Abstract
US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP synthetic beta (2013)
Bowen, C.M., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J. Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Heidelberg (2020)
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. arXiv preprint arXiv:1911.12704 (2020)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Bryant, V.: General description booklet for the 2012 public use tax file (2017)
Cilke, J.: The case of the missing strangers: what we know and don’t know about non-filers. In: 107th Annual Conference of the National Tax Association, Santa Fe, New Mexico (2014)
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 202–210 (2003)
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)
Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Report 2015-2 (2015)
Fuller, W.: Masking procedures for microdata disclosure. J. Off. Stat. 9(2), 383–406 (1993)
Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15
IRS: 2012 supplemental public use file (2019)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)
McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: an investigation with binary synthetic data. Trans. Data Priv. 5(3), 535–552 (2012)
Nowok, B., Raab, G.M., Snoke, J., Dibben, C., Nowok, M.B.: Package ‘synthpop’ (2019)
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2016)
Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)
Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014)
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A (Stat. Soc.) 181(3), 663–688 (2018)
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Winkler, W.E.: Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Statistics Research Division, US Bureau of the Census (2007)
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)
Acknowledgments
We are grateful to Joe Ansaldi, Don Boyd, Jim Cilke, John Czajka, Rick Evans, Dan Feenberg, Barry Johnson, Julia Lane, Graham MacDonald, Shannon Mok, Jim Nunns, James Pearce, Kevin Pierce, Alan Plumley, Daniel Silva-Inclan, Michael Strudler, Lars Vilhuber, Mike Weber, and Doug Wissoker for helpful comments and discussions.
This research is supported by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of the Tax Policy Center or its funders.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The appendix contains the supplementary materials to accompany the paper “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications” with additional results from the utility evaluation.
Figures 2 and 3 show the means and standard deviations, respectively, from the original and synthetic supplemental PUF data for age and all 17 tax variables. Figure 4 displays the correlation of the synthetic data minus the correlation of original data for all 17 tax variables.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bowen, C.M. et al. (2020). A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-57521-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)