A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

Bowen, Claire McKay; Bryant, Victoria; Burman, Leonard; Khitatrakun, Surachai; McClelland, Robert; Stallworth, Philip; Ueyama, Kyle; Williams, Aaron R.

doi:10.1007/978-3-030-57521-2_18

Claire McKay Bowen¹⁰,
Victoria Bryant¹¹,
Leonard Burman^10,12,
Surachai Khitatrakun¹⁰,
Robert McClelland¹⁰,
Philip Stallworth¹³,
Kyle Ueyama¹⁰ &
…
Aaron R. Williams¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

746 Accesses
6 Citations

Abstract

US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications

The Spanish personal income tax: facts and parametric estimates

Article Open access 11 July 2019

A Simple Method for Predicting Distributions by Means of Covariates with Examples from Poverty and Health Economics

Article Open access 11 January 2016

References

Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Chapter Google Scholar
Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP synthetic beta (2013)
Google Scholar
Bowen, C.M., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J. Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Heidelberg (2020)
Google Scholar
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. arXiv preprint arXiv:1911.12704 (2020)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Bryant, V.: General description booklet for the 2012 public use tax file (2017)
Google Scholar
Cilke, J.: The case of the missing strangers: what we know and don’t know about non-filers. In: 107th Annual Conference of the National Tax Association, Santa Fe, New Mexico (2014)
Google Scholar
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 202–210 (2003)
Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)
Article MathSciNet Google Scholar
Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Report 2015-2 (2015)
Google Scholar
Fuller, W.: Masking procedures for microdata disclosure. J. Off. Stat. 9(2), 383–406 (1993)
Google Scholar
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15
Chapter Google Scholar
IRS: 2012 supplemental public use file (2019)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)
Article Google Scholar
McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: an investigation with binary synthetic data. Trans. Data Priv. 5(3), 535–552 (2012)
MathSciNet Google Scholar
Nowok, B., Raab, G.M., Snoke, J., Dibben, C., Nowok, M.B.: Package ‘synthpop’ (2019)
Google Scholar
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2016)
Google Scholar
Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)
Google Scholar
Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014)
Google Scholar
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Article MathSciNet Google Scholar
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Chapter Google Scholar
Winkler, W.E.: Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Statistics Research Division, US Bureau of the Census (2007)
Google Scholar
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)
Google Scholar

Download references

Acknowledgments

We are grateful to Joe Ansaldi, Don Boyd, Jim Cilke, John Czajka, Rick Evans, Dan Feenberg, Barry Johnson, Julia Lane, Graham MacDonald, Shannon Mok, Jim Nunns, James Pearce, Kevin Pierce, Alan Plumley, Daniel Silva-Inclan, Michael Strudler, Lars Vilhuber, Mike Weber, and Doug Wissoker for helpful comments and discussions.

This research is supported by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of the Tax Policy Center or its funders.

Author information

Authors and Affiliations

Urban Institute, Washington D.C., 20024, USA
Claire McKay Bowen, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Kyle Ueyama & Aaron R. Williams
Internal Revenue Services, Washington D.C., 20002, USA
Victoria Bryant
Syracuse University, Syracuse, NY, 13244, USA
Leonard Burman
University of Michigan, Ann Arbor, MI, 48109, USA
Philip Stallworth

Authors

Claire McKay Bowen
View author publications
You can also search for this author in PubMed Google Scholar
Victoria Bryant
View author publications
You can also search for this author in PubMed Google Scholar
Leonard Burman
View author publications
You can also search for this author in PubMed Google Scholar
Surachai Khitatrakun
View author publications
You can also search for this author in PubMed Google Scholar
Robert McClelland
View author publications
You can also search for this author in PubMed Google Scholar
Philip Stallworth
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Ueyama
View author publications
You can also search for this author in PubMed Google Scholar
Aaron R. Williams
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claire McKay Bowen .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Appendix

The appendix contains the supplementary materials to accompany the paper “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications” with additional results from the utility evaluation.

Figures 2 and 3 show the means and standard deviations, respectively, from the original and synthetic supplemental PUF data for age and all 17 tax variables. Figure 4 displays the correlation of the synthetic data minus the correlation of original data for all 17 tax variables.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bowen, C.M. et al. (2020). A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_18
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics