Abstract
The United States Internal Revenue Service Statistics of Income (SOI) Division possesses invaluable administrative tax data from individual income tax returns that could vastly expand our understanding of how tax policies affect behavior and how those policies could be made more effective. However, only a small number of government analysts and researchers can access the raw data. The public use file (PUF) that SOI has produced for more than 60 years has become increasingly difficult to protect using traditional statistical disclosure control methods. The vast amount of personal information available in public and private databases combined with enormous computational power create unprecedented disclosure risks. SOI and researchers at the Urban Institute are developing synthetic data that represent the statistical properties of the administrative data without revealing any individual taxpayer information. This paper presents quality estimates of the first fully synthetic PUF and shows how it performs in tax model microsimulations as compared with the PUF and the confidential administrative data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
References
Bonnéry, D., et al.: The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data. J. Res. Educ. Effect. 12(4), 616–647 (2019)
Bowen, C.M., et al.: Synthetic individual income tax data: promises and challenges. Natl. Tax J. (Forthcoming)
Bowen, C.M.K., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_18
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart. Classification and Regression Trees (1984)
Bryant, V.: General description booklet for the 2012 public use tax file (2017)
Bryant, V.L., Czajka, J.L., Ivsin, G., Nunns, J.: Design changes to the SOI public use file (PUF). In: Proceedings. Annual Conference on Taxation and Minutes of the Annual Meeting of the National Tax Association, vol. 107, pp. 1–19. JSTOR (2014)
Burman, L.E., et al.: Safely expanding research access to administrative tax data: creating a synthetic public use file and a validation server. Technical report, Technical report US, Internal Revenue Service (2019)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Drechsler, J., Hu, J.: Synthesizing geocodes to facilitate access to detailed geographical information in large-scale administrative data. J. Surv. Stat. Methodol. 9(3), 523–548 (2021)
Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (2019)
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. Stockh. 9, 407 (1993)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078 (2017)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)
Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Acknowledgments
The projects outlined in this paper relied on the analytical capability that was made possible in part by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of Internal Revenue Service, the Urban Institute, or its funders.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bowen, C.M. et al. (2022). Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-13945-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)