Skip to main content

Advertisement

Log in

Collective entity resolution in multi-relational familial networks

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Entity resolution in settings with rich relational structure often introduces complex dependencies between co-references. Exploiting these dependencies is challenging—it requires seamlessly combining statistical, relational, and logical dependencies. One task of particular interest is entity resolution in familial networks. In this setting, multiple partial representations of a family tree are provided, from the perspective of different family members, and the challenge is to reconstruct a family tree from these multiple, noisy, partial views. This reconstruction is crucial for applications such as understanding genetic inheritance, tracking disease contagion, and performing census surveys. Here, we design a model that incorporates statistical signals (such as name similarity), relational information (such as sibling overlap), logical constraints (such as transitivity and bijective matching), and predictions from other algorithms (such as logistic regression and support vector machines), in a collective model. We show how to integrate these features using probabilistic soft logic, a scalable probabilistic programming framework. In experiments on real-world data, our model significantly outperforms state-of-the-art classifiers that use relational features but are incapable of collective reasoning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Code and data available at: https://github.com/pkouki/icdm2017.

  2. https://www.wikidata.org/.

  3. https://www.wikidata.org/wiki/Q76.

  4. Available at: https://github.com/fracpete/collective-classification-weka-package.

  5. ancestry.com, genealogy.com, familysearch.org, 23andMe.com.

References

  1. Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: IEEE international conference on data engineering (ICDE)

  2. Bach S, Broecheler M, Huang B, Getoor L (2017) Hinge-loss markov random fields and probabilistic soft logic. J Mach Learn Res (JMLR) 18(109):1–67

    MathSciNet  MATH  Google Scholar 

  3. Bach S, Huang B, London B, Getoor L (2013) Hinge-loss Markov random fields: convex inference for structured prediction. In: Uncertainty in artificial intelligence (UAI)

  4. Belin T, Rubin D (1995) A method for calibrating false-match rates in record linkage. J Am Stat Assoc 90(430):694–707

    Article  Google Scholar 

  5. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1). https://doi.org/10.1145/1217299.1217304

    Article  Google Scholar 

  6. Cessie S, Houwelingen J (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201

    Article  Google Scholar 

  7. Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):2:27:1–27:27

    Google Scholar 

  8. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin

    Book  Google Scholar 

  9. Culotta A, McCallum A (2005) Joint deduplication of multiple record types in relational data. In: ACM international conference on information and knowledge management (CIKM)

  10. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM special interest group on management of data (SIGMOD)

  11. Driessens K, Reutemann P, Pfahringer B, Leschi C (2006) Using weighted nearest neighbor to benefit from unlabeled data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)

  12. Efremova J, Ranjbar-Sahraei B, Rahmani H, Oliehoek F, Calders T, Tuyls K, Weiss G (2015) Multi-source entity resolution for genealogical data, population reconstruction

  13. Fellegi P, Sunter B (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  14. Frank E, Hall M, Witten I (2016) The WEKA Workbench. In: Gray J (ed) Practical machine learning tools and techniques. Morgan Kaufmann, Burlington (Online appendix for data mining)

  15. Goergen A, Ashida S, Skapinsky K, de Heer H, Wilkinson A, Koehly L (2016) Knowledge is power: improving family health history knowledge of diabetes and heart disease among multigenerational mexican origin families. Public Health Genomics 19(2):93–101

    Article  Google Scholar 

  16. Hand D, Christen P (2017) A note on using the f-measure for evaluating record linkage algorithms. Stat Comput 28(3):539–547

    Article  MathSciNet  Google Scholar 

  17. Hanneman R, Riddle F (2005) Introduction to social network methods. University of California, Riverside

    Google Scholar 

  18. Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H (2014) Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol 14:36

    Article  Google Scholar 

  19. Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University

  20. Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst (TODS) 31(2):716–767

    Article  Google Scholar 

  21. Kouki P, Marcum C, Koehly L, Getoor L (2016) Entity resolution in familial networks. In: SIGKDD conference on knowledge discovery and data mining (KDD), workshop on mining and learning with graphs

  22. Kouki P, Pujara J, Marcum C, Koehly L, Getoor L (2017) Collective entity resolution in familial networks. In: IEEE international conference on data mining (ICDM)

  23. Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 95(1–2):161–205

    Article  Google Scholar 

  24. Li X, Shen C (2008) Linkage of patient records from disparate sources. Stat Methods Med Res 22(1):31–8

    Article  MathSciNet  Google Scholar 

  25. Lin J, Marcum C, Myers M, Koehly L (2017) Put the family back in family health history: a multiple-informant approach. Am J Prev Med 5(52):640–644

    Article  Google Scholar 

  26. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  27. Newcombe H (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press Inc, Oxford

    Google Scholar 

  28. Nowozin S, Gehler P, Jancsary J, Lampert C (2014) Advanced structured prediction. The MIT Press, Cambridge

    Book  Google Scholar 

  29. Platanios E, Poon H, Mitchell T, Horvitz E (2017) Estimating accuracy from unlabeled data: a probabilistic logic approach. In: Conference on neural information processing systems (NIPS)

  30. Pujara J, Getoor L (2016) Generic statistical relational entity resolution in knowledge graphs. In: International joint conference on artificial intelligence (IJCAI), workshop on statistical relational artificial intelligence (StarAI)

  31. Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. In: International conference on very large databases (VLDB)

  32. Singla P, Domingos P (2006) Entity resolution with Markov logic. In: IEEE international conference on data mining (ICDM)

  33. Suchanek F, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the very large data bases endowment (PVLDB), vol 5(3)

    Article  Google Scholar 

  34. Winkler W (2006) Overview of record linkage and current research directions. Technical report, US Census Bureau

Download references

Acknowledgements

We would like to thank Peter Christen and Jon Berry for insightful comments on this paper. This work was partially supported by the National Science Foundation Grants IIS-1218488, CCF-1740850, and IIS-1703331 and by the National Human Genome Research Institute Division of Intramural Research at the National Institutes of Health (ZIA HG2000397 and ZIA HG200395, Koehly PI). We would also like to thank the Sandia LDRD (Laboratory-Directed Research and Development) program for support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Sandia Labs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pigi Kouki.

APPENDIX: PSL model rules

APPENDIX: PSL model rules

Name similarity rules

figure g

Personal information similarity rules

figure h

Relational similarity rules of 1st degree

figure i

Relational similarity rules of 2nd degree

figure j

Transitive relational (similarity) rules of 1st degree

figure k

Transitive relational (similarity) rules of 2nd degree

figure l

Bijection and transitivity rules

figure m

Rules to leverage existing classification algorithms

figure n

Prior rule

figure o

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kouki, P., Pujara, J., Marcum, C. et al. Collective entity resolution in multi-relational familial networks. Knowl Inf Syst 61, 1547–1581 (2019). https://doi.org/10.1007/s10115-018-1246-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1246-2

Keywords

Navigation