Bayesian Modeling for Simultaneous Regression and Record Linkage

Tang, Jiurui; Reiter, Jerome P.; Steorts, Rebecca C.

doi:10.1007/978-3-030-57521-2_15

Jiurui Tang¹⁰,
Jerome P. Reiter¹⁰ &
Rebecca C. Steorts¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

729 Accesses
4 Citations

Abstract

Often data analysts use probabilistic record linkage techniques to match records across two data sets. Such matching can be the primary goal, or it can be a necessary step to analyze relationships among the variables in the data sets. We propose a Bayesian hierarchical model that allows data analysts to perform simultaneous linear regression and probabilistic record linkage. This allows analysts to leverage relationships among the variables to improve linkage quality. Further, it enables analysts to propagate uncertainty in a principled way, while also potentially offering more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. We propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model, which we compare against a two-step process.

R. C. Steorts—This research was partially supported by the National Science Foundation through grants SES1131897, SES1733835, SES1652431 and SES1534412.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Article Google Scholar
Christen, P.: Data linkage: the big picture. Harv. Data Sci. Rev. 1(2) (2019)
Google Scholar
Dalzell, N.M., Reiter, J.P.: Regression modeling and file matching using possibly erroneous matching variables. J. Comput. Graph. Stat. 27, 728–738 (2018)
Article MathSciNet Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Google Scholar
Fortini, M., Liseo, B., Nuccitelli, A., Scanu, M.: On Bayesian record linkage. Res. Off. Stat. 4, 185–198 (2001)
Google Scholar
Gutman, R., Afendulis, C.C., Zaslavsky, A.M.: A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Am. Stat. Assoc. 108(501), 34–47 (2013)
Google Scholar
Hof, M.H., Ravelli, A.C., To, A.H.Z.: A probabilistic record linkage model for survival data. J. Am. Stat. Assoc. 112(520), 1504–1515 (2017)
Article MathSciNet Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84, 414–420 (1989)
Article Google Scholar
Larsen, M.D.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96, 32–41 (2001)
Article MathSciNet Google Scholar
Larsen, M.D.: Comments on hierarchical Bayesian record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, 1995–2000 (2002)
Google Scholar
Larsen, M.D.: Advances in record linkage theory: hierarchical Bayesian record linkage theory. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 3277–3284 (2005)
Google Scholar
Larsen, M.D., Rubin, D.B.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96(453), 32–41 (2001)
Article MathSciNet Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707–710 (1965)
MathSciNet MATH Google Scholar
Marchant, N.G., Steorts, R.C., Kaplan, A., Rubinstein, B.I., Elazar, D.N.: d-blink: distributed end-to-end Bayesian entity resolution (2019). arXiv preprint arXiv:1909.06039
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records: computers can be used to extract “follow-up” statistics of families from files of routine records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Sadinle, M.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8(4), 2404–2434 (2014). MR3292503
Google Scholar
Sadinle, M.: Bayesian estimation of bipartite matchings for record linkage. J. Am. Stat. Assoc. 112, 600–612 (2017)
Article MathSciNet Google Scholar
Sadinle, M., Fienberg, S.E.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)
Article MathSciNet Google Scholar
Steorts, R.C.: Entity resolution with empirically motivated priors. Bayesian Anal. 10(4), 849–875 (2015). MR3432242
Google Scholar
Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)
Article MathSciNet Google Scholar
Tancredi, A., Liseo, B.: A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). MR2849786
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 354–359 (1990)
Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. Technical report. Statistics #2006-2, U.S. Bureau of the Census (2006)
Google Scholar
Winkler, W.E.: Matching and record linkage. Wiley Interdisc. Rev.: Comput. Stat. 6(5), 313–325 (2014)
Article Google Scholar
Zanella, G., Betancourt, B., Wallach, H., Miller, J., Zaidi, A., Steorts, R.C.: Flexible models for microclustering with application to entity resolution. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, NY, USA. Curran Associates Inc., pp. 1425–1433 (2016)
Google Scholar
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistical Science, Duke University, Durham, USA
Jiurui Tang, Jerome P. Reiter & Rebecca C. Steorts

Authors

Jiurui Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jerome P. Reiter
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca C. Steorts
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiurui Tang .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Appendices

A Record Linkage Evaluation Metrics

Here, we review the definitions of the average numbers of correct links (CL), correct non-links (CNL), false negatives (FN), and false positives (FP). These allow one to calculate the false negative rate (FNR) and false discovery rate (FDR) [19]. For any MCMC iteration t, we define CL\(^{[t]}\) as the number of record pairs with \(Z_j \le n_1\) and that are true links. We define CNL\(^{[t]}\) as the number of record pairs with \(Z_j > n_1\) that also are not true links. We define FN\(^{[t]}\) as the number of record pairs that are true links but have \(Z_j > n_1\). We define FP\(^{[t]}\) as the number of record pairs that are not true links but have \(Z_j \le n_1\). In the simulations, the true number of true links is CL\(^{[t]}\)+FN\(^{[t]}=750\), and the estimated number of links is CL\(^{[t]}\)+FP\(^{[t]}\). Thus, FNR\(^{[t]} = \) is FN\(^{[t]}\)/(CL\(^{[t]}\)+FN\(^{[t]}\)). The FDR\(^{[t]} = \) FP\(^{[t]}\)/(CL\(^{[t]}\)+FP\(^{[t]}\)), where by convention we take FDR\(^{[t]} = 0\) when both the numerator and denominator are 0. We report the FDR instead of the FPR, as an algorithm that does not link any records has a small FPR, but this does not mean that it is a good algorithm. Finally, for each metric, we compute the posterior means across all MCMC iterations, which we average across all simulations.

B Additional Simulations with a Mis-specified Regression

As an additional simulation, we examine the performance of the hierarchical model in terms of linkage quality when we use a mis-specified regression. The true data generating model is \(\log (\mathbf {Y})|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\), but we incorrectly assume \(\mathbf {Y}|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\) in the hierarchical model. Table 3 summarizes the measures of linkage quality when the linkage variables have weak information. Even though the regression component of the hierarchical model is mis-specified, the hierarchical model still identifies more correct non-matches than the two-step approach identifies, although the difference is less obvious than when we use the correctly specified regression. We see a similar trend when the information in the linking variables is strong, albeit with smaller differences between the two-step approach and the hierarchical model.

Table 3. Results for simulation with mis-specified regression component in the hierarchical model. Entries summarize the linkage quality across 100 simulation runs. Averages in first four columns have standard errors less than 3. Averages in the last two columns have Monte Carlo standard errors less than .002.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, J., Reiter, J.P., Steorts, R.C. (2020). Bayesian Modeling for Simultaneous Regression and Record Linkage. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_15
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics