Skip to main content

A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2020)

Abstract

US government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the Statistics of Income division (SOI) of the Internal Revenue Service releases an annual public use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, SOI has taken increasingly aggressive measures to protect the data in the face of growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. In this paper, we describe our approach to generating a fully synthetic representation of the income tax data by using sequential Classification and Regression Trees and kernel density smoothing. This synthetic data file represents previously unreleased information useful for tax policy modeling. We also tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  • Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20

    Chapter  Google Scholar 

  • Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP synthetic beta (2013)

    Google Scholar 

  • Bowen, C.M., et al.: A synthetic supplemental public use file of low-income information return data: methodology, utility, and privacy implications. In: Domingo-Ferrer, J. Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 257–270. Springer, Heidelberg (2020)

    Google Scholar 

  • Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. arXiv preprint arXiv:1911.12704 (2020)

  • Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    MATH  Google Scholar 

  • Bryant, V.: General description booklet for the 2012 public use tax file (2017)

    Google Scholar 

  • Cilke, J.: The case of the missing strangers: what we know and don’t know about non-filers. In: 107th Annual Conference of the National Tax Association, Santa Fe, New Mexico (2014)

    Google Scholar 

  • Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 202–210 (2003)

    Google Scholar 

  • Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)

    Article  MathSciNet  Google Scholar 

  • Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Report 2015-2 (2015)

    Google Scholar 

  • Fuller, W.: Masking procedures for microdata disclosure. J. Off. Stat. 9(2), 383–406 (1993)

    Google Scholar 

  • Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15

    Chapter  Google Scholar 

  • IRS: 2012 supplemental public use file (2019)

    Google Scholar 

  • Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3-es (2007)

    Article  Google Scholar 

  • McClure, D., Reiter, J.P.: Differential privacy and statistical disclosure risk measures: an investigation with binary synthetic data. Trans. Data Priv. 5(3), 535–552 (2012)

    MathSciNet  Google Scholar 

  • Nowok, B., Raab, G.M., Snoke, J., Dibben, C., Nowok, M.B.: Package ‘synthpop’ (2019)

    Google Scholar 

  • Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2016)

    Google Scholar 

  • Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)

    Google Scholar 

  • Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014)

    Google Scholar 

  • Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A (Stat. Soc.) 181(3), 663–688 (2018)

    Article  MathSciNet  Google Scholar 

  • Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9

    Chapter  Google Scholar 

  • Winkler, W.E.: Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Statistics Research Division, US Bureau of the Census (2007)

    Google Scholar 

  • Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)

    Google Scholar 

Download references

Acknowledgments

We are grateful to Joe Ansaldi, Don Boyd, Jim Cilke, John Czajka, Rick Evans, Dan Feenberg, Barry Johnson, Julia Lane, Graham MacDonald, Shannon Mok, Jim Nunns, James Pearce, Kevin Pierce, Alan Plumley, Daniel Silva-Inclan, Michael Strudler, Lars Vilhuber, Mike Weber, and Doug Wissoker for helpful comments and discussions.

This research is supported by a grant from Arnold Ventures. The findings and conclusions are those of the authors and do not necessarily reflect positions or policies of the Tax Policy Center or its funders.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claire McKay Bowen .

Editor information

Editors and Affiliations

Appendix

Appendix

The appendix contains the supplementary materials to accompany the paper “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications” with additional results from the utility evaluation.

Figures 2 and 3 show the means and standard deviations, respectively, from the original and synthetic supplemental PUF data for age and all 17 tax variables. Figure 4 displays the correlation of the synthetic data minus the correlation of original data for all 17 tax variables.

Fig. 2.
figure 2

Means from the original and synthetic Supplemental PUF data. Age is on the x-axis scale, but not in dollar amounts.

Fig. 3.
figure 3

Standard Deviations from the original and synthetic Supplemental PUF data. Age is on the x-axis scale, but not in dollar amounts.

Fig. 4.
figure 4

Difference in Correlation (correlation of the synthetic minus the correlation of the original Supplemental PUF data).

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bowen, C.M. et al. (2020). A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics