Skip to main content

Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2022)

Abstract

The United States Internal Revenue Service Statistics of Income (SOI) Division possesses invaluable administrative tax data from individual income tax returns that could vastly expand our understanding of how tax policies affect behavior and how those policies could be made more effective. However, only a small number of government analysts and researchers can access the raw data. The public use file (PUF) that SOI has produced for more than 60 years has become increasingly difficult to protect using traditional statistical disclosure control methods. The vast amount of personal information available in public and private databases combined with enormous computational power create unprecedented disclosure risks. SOI and researchers at the Urban Institute are developing synthetic data that represent the statistical properties of the administrative data without revealing any individual taxpayer information. This paper presents quality estimates of the first fully synthetic PUF and shows how it performs in tax model microsimulations as compared with the PUF and the confidential administrative data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See https://pslmodels.org/.

References

  • Bonnéry, D., et al.: The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data. J. Res. Educ. Effect. 12(4), 616–647 (2019)

    Google Scholar 

  • Bowen, C.M., et al.: Synthetic individual income tax data: promises and challenges. Natl. Tax J. (Forthcoming)

    Google Scholar 

  • Bowen, C.M.K., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_18

    Chapter  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart. Classification and Regression Trees (1984)

    Google Scholar 

  • Bryant, V.: General description booklet for the 2012 public use tax file (2017)

    Google Scholar 

  • Bryant, V.L., Czajka, J.L., Ivsin, G., Nunns, J.: Design changes to the SOI public use file (PUF). In: Proceedings. Annual Conference on Taxation and Minutes of the Annual Meeting of the National Tax Association, vol. 107, pp. 1–19. JSTOR (2014)

    Google Scholar 

  • Burman, L.E., et al.: Safely expanding research access to administrative tax data: creating a synthetic public use file and a validation server. Technical report, Technical report US, Internal Revenue Service (2019)

    Google Scholar 

  • Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  • Drechsler, J., Hu, J.: Synthesizing geocodes to facilitate access to detailed geographical information in large-scale administrative data. J. Surv. Stat. Methodol. 9(3), 523–548 (2021)

    Article  Google Scholar 

  • Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (2019)

    Google Scholar 

  • Little, R.J.: Statistical analysis of masked data. J. Off. Stat. Stockh. 9, 407 (1993)

    Google Scholar 

  • Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)

    MATH  Google Scholar 

  • Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)

    Google Scholar 

  • Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078 (2017)

  • Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)

    Google Scholar 

  • Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)

    Google Scholar 

  • Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

Download references

Acknowledgments

The projects outlined in this paper relied on the analytical capability that was made possible in part by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of Internal Revenue Service, the Urban Institute, or its funders.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claire McKay Bowen .

Editor information

Editors and Affiliations

Appendix

Appendix

Table 1. Number of duplicate records out of the possible 265, 239 records.
Table 2. l-diversity results within each variable.
Table 3. l-diversity results across observations.
Fig. 1.
figure 1

Each dot represents a variable, such as wages and salaries or interest income. The diagonal line represents equivalence, and dots off of it indicate that the SynPUF and modINSOLE have different variable weighted means and standard deviations.

Fig. 2.
figure 2

Density of pairwise correlation differences.

Fig. 3.
figure 3

Tax microsimulation results from the confidential data, PUF, and SynPUF, which are grouped in that order for these plots.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bowen, C.M. et al. (2022). Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13945-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13944-4

  • Online ISBN: 978-3-031-13945-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics