Skip to main content

Bayesian Modeling for Simultaneous Regression and Record Linkage

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

Abstract

Often data analysts use probabilistic record linkage techniques to match records across two data sets. Such matching can be the primary goal, or it can be a necessary step to analyze relationships among the variables in the data sets. We propose a Bayesian hierarchical model that allows data analysts to perform simultaneous linear regression and probabilistic record linkage. This allows analysts to leverage relationships among the variables to improve linkage quality. Further, it enables analysts to propagate uncertainty in a principled way, while also potentially offering more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. We propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model, which we compare against a two-step process.

R. C. Steorts—This research was partially supported by the National Science Foundation through grants SES1131897, SES1733835, SES1652431 and SES1534412.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  2. Christen, P.: Data linkage: the big picture. Harv. Data Sci. Rev. 1(2) (2019)

    Google Scholar 

  3. Dalzell, N.M., Reiter, J.P.: Regression modeling and file matching using possibly erroneous matching variables. J. Comput. Graph. Stat. 27, 728–738 (2018)

    Article  MathSciNet  Google Scholar 

  4. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Google Scholar 

  5. Fortini, M., Liseo, B., Nuccitelli, A., Scanu, M.: On Bayesian record linkage. Res. Off. Stat. 4, 185–198 (2001)

    Google Scholar 

  6. Gutman, R., Afendulis, C.C., Zaslavsky, A.M.: A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Am. Stat. Assoc. 108(501), 34–47 (2013)

    Google Scholar 

  7. Hof, M.H., Ravelli, A.C., To, A.H.Z.: A probabilistic record linkage model for survival data. J. Am. Stat. Assoc. 112(520), 1504–1515 (2017)

    Article  MathSciNet  Google Scholar 

  8. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84, 414–420 (1989)

    Article  Google Scholar 

  9. Larsen, M.D.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96, 32–41 (2001)

    Article  MathSciNet  Google Scholar 

  10. Larsen, M.D.: Comments on hierarchical Bayesian record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, 1995–2000 (2002)

    Google Scholar 

  11. Larsen, M.D.: Advances in record linkage theory: hierarchical Bayesian record linkage theory. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 3277–3284 (2005)

    Google Scholar 

  12. Larsen, M.D., Rubin, D.B.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96(453), 32–41 (2001)

    Article  MathSciNet  Google Scholar 

  13. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707–710 (1965)

    MathSciNet  MATH  Google Scholar 

  14. Marchant, N.G., Steorts, R.C., Kaplan, A., Rubinstein, B.I., Elazar, D.N.: d-blink: distributed end-to-end Bayesian entity resolution (2019). arXiv preprint arXiv:1909.06039

  15. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records: computers can be used to extract “follow-up” statistics of families from files of routine records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  16. Sadinle, M.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8(4), 2404–2434 (2014). MR3292503

    Google Scholar 

  17. Sadinle, M.: Bayesian estimation of bipartite matchings for record linkage. J. Am. Stat. Assoc. 112, 600–612 (2017)

    Article  MathSciNet  Google Scholar 

  18. Sadinle, M., Fienberg, S.E.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)

    Article  MathSciNet  Google Scholar 

  19. Steorts, R.C.: Entity resolution with empirically motivated priors. Bayesian Anal. 10(4), 849–875 (2015). MR3432242

    Google Scholar 

  20. Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)

    Article  MathSciNet  Google Scholar 

  21. Tancredi, A., Liseo, B.: A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). MR2849786

    Google Scholar 

  22. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 354–359 (1990)

    Google Scholar 

  23. Winkler, W.E.: Overview of record linkage and current research directions. Technical report. Statistics #2006-2, U.S. Bureau of the Census (2006)

    Google Scholar 

  24. Winkler, W.E.: Matching and record linkage. Wiley Interdisc. Rev.: Comput. Stat. 6(5), 313–325 (2014)

    Article  Google Scholar 

  25. Zanella, G., Betancourt, B., Wallach, H., Miller, J., Zaidi, A., Steorts, R.C.: Flexible models for microclustering with application to entity resolution. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, NY, USA. Curran Associates Inc., pp. 1425–1433 (2016)

    Google Scholar 

  26. Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiurui Tang .

Editor information

Editors and Affiliations

Appendices

A Record Linkage Evaluation Metrics

Here, we review the definitions of the average numbers of correct links (CL), correct non-links (CNL), false negatives (FN), and false positives (FP). These allow one to calculate the false negative rate (FNR) and false discovery rate (FDR) [19]. For any MCMC iteration t, we define CL\(^{[t]}\) as the number of record pairs with \(Z_j \le n_1\) and that are true links. We define CNL\(^{[t]}\) as the number of record pairs with \(Z_j > n_1\) that also are not true links. We define FN\(^{[t]}\) as the number of record pairs that are true links but have \(Z_j > n_1\). We define FP\(^{[t]}\) as the number of record pairs that are not true links but have \(Z_j \le n_1\). In the simulations, the true number of true links is CL\(^{[t]}\)+FN\(^{[t]}=750\), and the estimated number of links is CL\(^{[t]}\)+FP\(^{[t]}\). Thus, FNR\(^{[t]} = \) is FN\(^{[t]}\)/(CL\(^{[t]}\)+FN\(^{[t]}\)). The FDR\(^{[t]} = \) FP\(^{[t]}\)/(CL\(^{[t]}\)+FP\(^{[t]}\)), where by convention we take FDR\(^{[t]} = 0\) when both the numerator and denominator are 0. We report the FDR instead of the FPR, as an algorithm that does not link any records has a small FPR, but this does not mean that it is a good algorithm. Finally, for each metric, we compute the posterior means across all MCMC iterations, which we average across all simulations.

B Additional Simulations with a Mis-specified Regression

As an additional simulation, we examine the performance of the hierarchical model in terms of linkage quality when we use a mis-specified regression. The true data generating model is \(\log (\mathbf {Y})|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\), but we incorrectly assume \(\mathbf {Y}|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\) in the hierarchical model. Table 3 summarizes the measures of linkage quality when the linkage variables have weak information. Even though the regression component of the hierarchical model is mis-specified, the hierarchical model still identifies more correct non-matches than the two-step approach identifies, although the difference is less obvious than when we use the correctly specified regression. We see a similar trend when the information in the linking variables is strong, albeit with smaller differences between the two-step approach and the hierarchical model.

Table 3. Results for simulation with mis-specified regression component in the hierarchical model. Entries summarize the linkage quality across 100 simulation runs. Averages in first four columns have standard errors less than 3. Averages in the last two columns have Monte Carlo standard errors less than .002.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, J., Reiter, J.P., Steorts, R.C. (2020). Bayesian Modeling for Simultaneous Regression and Record Linkage. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics