Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications

Bowen, Claire McKay; Bryant, Victoria; Burman, Leonard; Czajka, John; Khitatrakun, Surachai; MacDonald, Graham; McClelland, Robert; Mucciolo, Livia; Pickens, Madeline; Ueyama, Kyle; Williams, Aaron R.; Wissoker, Doug; Zwiefel, Noah

doi:10.1007/978-3-031-13945-1_14

Claire McKay Bowen⁹,
Victoria Bryant¹⁰,
Leonard Burman⁹,
John Czajka¹¹,
Surachai Khitatrakun⁹,
Graham MacDonald⁹,
Robert McClelland⁹,
Livia Mucciolo⁹,
Madeline Pickens⁹,
Kyle Ueyama¹²,
Aaron R. Williams⁹,
Doug Wissoker⁹ &
…
Noah Zwiefel¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

611 Accesses
2 Citations

Abstract

The United States Internal Revenue Service Statistics of Income (SOI) Division possesses invaluable administrative tax data from individual income tax returns that could vastly expand our understanding of how tax policies affect behavior and how those policies could be made more effective. However, only a small number of government analysts and researchers can access the raw data. The public use file (PUF) that SOI has produced for more than 60 years has become increasingly difficult to protect using traditional statistical disclosure control methods. The vast amount of personal information available in public and private databases combined with enormous computational power create unprecedented disclosure risks. SOI and researchers at the Urban Institute are developing synthetic data that represent the statistical properties of the administrative data without revealing any individual taxpayer information. This paper presents quality estimates of the first fully synthetic PUF and shows how it performs in tax model microsimulations as compared with the PUF and the confidential administrative data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

Private Posterior Inference Consistent with Public Information: A Case Study in Small Area Estimation from Synthetic Census Data

The Spanish personal income tax: facts and parametric estimates

Article Open access 11 July 2019

Notes

1.
See https://pslmodels.org/.

References

Bonnéry, D., et al.: The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data. J. Res. Educ. Effect. 12(4), 616–647 (2019)
Google Scholar
Bowen, C.M., et al.: Synthetic individual income tax data: promises and challenges. Natl. Tax J. (Forthcoming)
Google Scholar
Bowen, C.M.K., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_18
Chapter Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart. Classification and Regression Trees (1984)
Google Scholar
Bryant, V.: General description booklet for the 2012 public use tax file (2017)
Google Scholar
Bryant, V.L., Czajka, J.L., Ivsin, G., Nunns, J.: Design changes to the SOI public use file (PUF). In: Proceedings. Annual Conference on Taxation and Minutes of the Annual Meeting of the National Tax Association, vol. 107, pp. 1–19. JSTOR (2014)
Google Scholar
Burman, L.E., et al.: Safely expanding research access to administrative tax data: creating a synthetic public use file and a validation server. Technical report, Technical report US, Internal Revenue Service (2019)
Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Drechsler, J., Hu, J.: Synthesizing geocodes to facilitate access to detailed geographical information in large-scale administrative data. J. Surv. Stat. Methodol. 9(3), 523–548 (2021)
Article Google Scholar
Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (2019)
Google Scholar
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. Stockh. 9, 407 (1993)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)
MATH Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)
Google Scholar
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078 (2017)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)
Google Scholar
Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)
Google Scholar
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar

Download references

Acknowledgments

The projects outlined in this paper relied on the analytical capability that was made possible in part by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of Internal Revenue Service, the Urban Institute, or its funders.

Author information

Authors and Affiliations

Urban Institute, Washington, D.C., 20024, USA
Claire McKay Bowen, Leonard Burman, Surachai Khitatrakun, Graham MacDonald, Robert McClelland, Livia Mucciolo, Madeline Pickens, Aaron R. Williams & Doug Wissoker
Internal Revenue Services, Washington, D.C., 20002, USA
Victoria Bryant
Bethesda, MD, 20816, USA
John Czajka
Coiled, New York, NY, 10018, USA
Kyle Ueyama
University College London, London, UK
Noah Zwiefel

Authors

Claire McKay Bowen
View author publications
You can also search for this author in PubMed Google Scholar
Victoria Bryant
View author publications
You can also search for this author in PubMed Google Scholar
Leonard Burman
View author publications
You can also search for this author in PubMed Google Scholar
John Czajka
View author publications
You can also search for this author in PubMed Google Scholar
Surachai Khitatrakun
View author publications
You can also search for this author in PubMed Google Scholar
Graham MacDonald
View author publications
You can also search for this author in PubMed Google Scholar
Robert McClelland
View author publications
You can also search for this author in PubMed Google Scholar
Livia Mucciolo
View author publications
You can also search for this author in PubMed Google Scholar
Madeline Pickens
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Ueyama
View author publications
You can also search for this author in PubMed Google Scholar
Aaron R. Williams
View author publications
You can also search for this author in PubMed Google Scholar
Doug Wissoker
View author publications
You can also search for this author in PubMed Google Scholar
Noah Zwiefel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claire McKay Bowen .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Télécom SudParis, Palaiseau, France
Maryline Laurent

Appendix

Table 1. Number of duplicate records out of the possible 265, 239 records.

Full size table

Table 2. l-diversity results within each variable.

Full size table

Table 3. l-diversity results across observations.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bowen, C.M. et al. (2022). Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-13945-1_14
Published: 14 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications

Abstract

Access this chapter

Similar content being viewed by others

A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

Private Posterior Inference Consistent with Public Information: A Case Study in Small Area Estimation from Synthetic Census Data

The Spanish personal income tax: facts and parametric estimates

Notes

References

Acknowledgments