Skip to main content

Disclosure Avoidance in the Census Bureau’s 2010 Demonstration Data Product

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Abstract

Producing accurate, usable data while protecting respondent privacy are dual mandates of the US Census Bureau. In 2019, the Census Bureau announced it would use a new disclosure avoidance technique, based on differential privacy, for the 2020 Decennial Census of Population and Housing [19]. Instead of suppressing data or swapping sensitive records, differentially private methods inject noise into counts to protect privacy. Unfortunately, noise injection may also make the data less useful and accurate. This paper describes the differentially private Disclosure Avoidance System (DAS) used to prepare the 2010 Demonstration Data Product (DDP). It describes the policy decisions that underlie the DAS and how the DAS uses those policy decisions to produce differentially private data. Finally, it discusses usability and accuracy issues in the DDP, with a focus on occupied housing unit counts. Occupied housing unit counts in the DDP differed greatly from 2010 Summary File 1 differed greatly, and the paper explains possible sources of the differences.

Supported by the Minnesota Population Center (R24 HD041023), funded through grants from the Eunice Kennedy Shriver National Institute for Child Health and Human Development.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Space constraints prevent us from a complete discussion of the Bureau’s disclosure avoidance techniques. Interested readers are directed to McKenna [24, 25].

  2. 2.

    The database reconstruction theorem states that respondent privacy is compromised when too many accurate statistics are published from the confidential data. For the 2010 decennial census, more than 150 billion statistics were published [23].

  3. 3.

    The Bureau reconstructed microdata from a set of 2010 decennial tables. Tables P1, P6, P7, P9, P11, P12, P12A-I, and P14 for census blocks and PCT12A-N for census tracts were used in the reconstruction [23].

  4. 4.

    The Bureau linked the two datasets by exact age and by age plus or minus one year. [23].

  5. 5.

    Readers interested in learning more about differential privacy are directed to Wood et al. [32] and Reiter [27]. These papers provide a relatively non-technical introduction to the topic. A critique of differential privacy can be found in Bambauer et al. [5].

  6. 6.

    The Census Bureau executed a reconstruction and re-identification attack on the 2010 Demonstration Data Product, which was generated from a differentially private algorithm. Approximately 5% of the reconstructed microdata records were successfully matched to confidential data. The 5% re-identification rate represents an improvement over the 17% re-identified from the 2010 decennial census data, but it still represents approximately 15 million census respondents [21].

  7. 7.

    The 2010 Demonstration Data Product was the Census Bureau’s third dataset produced by the Disclosure Avoidance System (DAS). The DAS consists of the Bureau’s differentially private algorithm and the post-processing routines required to enforce constraints. The first dataset contained tabulations from the 2018 Census Test enumeration phase, carried out in Providence County, Rhode Island. The second dataset consists of multiple runs of the DAS over the 1940 complete-count census microdata from IPUMS. Details are available in [3, 22].

  8. 8.

    The Demographic and Housing Characteristics dataset is the replacement for Summary File 1.

  9. 9.

    All decennial census products, except for Congressional apportionment counts, are derived from the Census Edited File (CEF). The CEF is produced through a series of imputations and allocations that fill in missing data from individual census returns and resolve inconsistencies. Readers interested in a more detailed discussion of the CEF production are directed to pages 10–11 of boyd [6].

  10. 10.

    The National Academies of Sciences, Engineering, and Medicine’s Committee on National Statistics (CNStat) hosted a 2-day workshop on December 11–12, 2019. Census Bureau staff members presented details of the algorithm used to create the DDP. Census data users presented results from analyses that compared the 2010 DDP with 2010 Summary File 1 and PL94-171 data products. Privacy experts discussed issues surrounding the decennial census and potential harms of re-identification. Videos and slides from the workshop are available at https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518.

  11. 11.

    Technically, \(\epsilon \) must be greater than 0. If \(\epsilon \) was zero, then no data would be published.

  12. 12.

    The census tract group is not a standard unit in the Census Bureau’s geographic hierarchy. It was created specifically for the DAS to control the number of child units for each county. The census tract group consists of all census tracts with the same first four digits of their code (e.g., tract group 1001 consists of tracts 1001.01 and 1001.02). The DDP does not include data for tract groups.

  13. 13.

    At the time the DAS for the 2010 Demonstration Data Product was designed, the Census Bureau assumed the citizenship question would be included on the 2020 Decennial Census questionnaire. Even though the US Supreme Court ruled in favor of the plaintiffs and removed the question, the Bureau did not have time to remove the citizenship variable from the DAS. No actual citizenship data was used to create the 2010 DDP; instead, the Bureau imputed citizenship status for records in the CEF [10].

  14. 14.

    The geographic level associated with an invariant is the lowest level at which the invariant holds. All geographic levels composed of the lowest level will also be invariant.

  15. 15.

    Sensitivity is the value by which a query changes if we make a single modification to the database. Histogram queries have a sensitivity of 2 - if we increase the count in a cell by 1, we must decrease the count in another cell by 1.

  16. 16.

    Two types of distributions - the two-tailed geometric and the Laplace - are typically used to achieve differential privacy. The two-tailed geometric distribution is used when integers are required, and the Laplace distribution is used when real numbers are required. Source code for the 2010 DDP includes functions for both types of distributions [11].

  17. 17.

    The optimization problem is actually solved in two stages, one to enforce non-negativity and optimize over the set of queries and a second stage to produce an integer-valued detailed histogram.

  18. 18.

    The Census Bureau fielded so many questions about occupancy rates that they added a question to their FAQ [7]. The answer mentions that Census would look into the issue and post answers or updates. As of 2020-05-22, no answers or updates have been posted.

  19. 19.

    Readers interested in learning more about the discrepancy should watch Beth Jarosz’ presentation at the December 2019 CNStat workshop on the 2010 Demonstration Data Product [20].

References

  1. Abowd, J.: Disclosure avoidance for block level data and protection of confidentiality in public tabulations, December 2018. https://www2.census.gov/cac/sac/meetings/2018-12/abowd-disclosure-avoidance.pdf

  2. Abowd, J.: Protecting the Confidentiality of America’s Statistics: Adopting Modern Disclosure Avoidance Methods at the Census Bureau (2018). https://www.census.gov/newsroom/blogs/research-matters/2018/08/protecting_the_confi.html

  3. Abowd, J., Garfinkel, S.: Disclosure Avoidance and the 2018 Census Test: Release of the Source Code (2019). https://www.census.gov/newsroom/blogs/research-matters/2019/06/disclosure_avoidance.html

  4. Akee, R.: Population counts on American Indian Reservations and Alaska Native Villages, with and without the application of differential privacy. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  5. Bambauer, J., Muralidhar, K., Sarathy, R.: Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law 16, 55 (2014)

    Google Scholar 

  6. Boyd, D.: Balancing data utility and confidentiality in the 2020 US census. Technical report, Data and Society, New York, NY, December 2019

    Google Scholar 

  7. Census Bureau: Frequently Asked Questions for the 2010 Demonstration Data Products. https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products/faqs.html

  8. Census Bureau: Standard hierarchy of census geographic entities. Technical report, US Census Bureau, Washington DC, July 2010

    Google Scholar 

  9. Census Bureau: 2010 Demonstration Data Products (2019). https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products.html

  10. Census Bureau: 2010 demonstration P.L. 94–171 redistricting summary file and demographic and housing demonstration file: technical documentation. Technical report, US Department of Commerce, Washington, DC, October 2019

    Google Scholar 

  11. Census Bureau: 2020 Census 2010 Demonstration Data Products Disclosure Avoidance System. US Census Bureau (2019)

    Google Scholar 

  12. Census Bureau: 2020 census 2010 demonstration data products disclosure avoidance system: design specification, version 1.4. Technical report, US Census Bureau, Washington DC (2019)

    Google Scholar 

  13. Cormode, G.: Building blocks of privacy: differentially private mechanisms. Technical report, Rutgers University, Brunswick, NJ (nd)

    Google Scholar 

  14. Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS 2003, pp. 202–210. ACM, New York (2003). https://doi.org/10.1145/773153.773173

  15. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  16. Fontenot Jr., A.: 2010 demonstration data products - design parameters and global privacy-loss budget. Technical report 2019.25, US Census Bureau, Washington, DC, October 2019

    Google Scholar 

  17. Garfinkel, S., Abowd, J.M., Martindale, C.: Understanding database reconstruction attacks on public data. ACM Queue 16(5), 28–53 (2018)

    Article  Google Scholar 

  18. Garfinkel, S.L., Abowd, J.M., Powazek, S.: Issues encountered deploying differential privacy. In: Proceedings of the 2018 Workshop on Privacy in the Electronic Society. WPES 2018, pp. 133–137. ACM, New York (2018). https://doi.org/10.1145/3267323.3268949

  19. Jarmin, R.: Census Bureau Adopts Cutting Edge Privacy Protections for 2020 Census (2019). https://www.census.gov/newsroom/blogs/random-samplings/2019/02/census_bureau_adopts.html

  20. Jarosz, B.: Importance of decennial census for regional planning in California. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  21. Leclerc, P.: The 2020 decennial census topdown disclosure limitation algorithm: a report on the current state of the privacy loss-accuracy trade-off. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington DC, December 2019

    Google Scholar 

  22. Leclerc, P.: Guide to the census 2018 end-to-end test disclosure avoidance algorithm and implementation. Technical report, US Census Bureau, Washington DC, July 2019

    Google Scholar 

  23. Leclerc, P.: Reconstruction of person level data from data presented in multiple tables. In: Challenges and New Approaches for Protecting Privacy in Federal Statistical Programs: A Workshop, Washington, DC, June 2019

    Google Scholar 

  24. McKenna, L.: Disclosure avoidance techniques used for the 1970 through 2010 decennial censuses of population and housing. Technical report 18–47, US Census Bureau, Washington, DC (2018)

    Google Scholar 

  25. McKenna, L.: Disclosure avoidance techniques used for the 1960 through 2010 decennial censuses of population and housing public use microdata samples. Technical report, US Census Bureau, Washington, DC (2019)

    Google Scholar 

  26. Nagle, N.: Implications for municipalities and school enrollment statistics. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  27. Reiter, J.P.: Differential privacy and federal data releases. Ann. Rev. Stat. Appl. 6(1), 85–101 (2019). https://doi.org/10.1146/annurev-statistics-030718-105142

    Article  MathSciNet  Google Scholar 

  28. Sandberg, E.: Privatized data for Alaska communities. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  29. Santos-Lozada, A.: Differential privacy and mortality rates in the United States. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  30. Santos-Lozada, A.R., Howard, J.T., Verdery, A.M.: How differential privacy will affect our understanding of health disparities in the United States. Proc. Natl. Acad. Sci. (2020). https://doi.org/10.1073/pnas.2003714117

    Article  Google Scholar 

  31. Spielman, S., Van Riper, D.: Geographic review of differentially private demonstration data. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019

    Google Scholar 

  32. Wood, A., et al.: Differential privacy: a primer for a non-technical audience. Vanderbilt J. Entertain. Technol. Law 21(1), 209–276 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Van Riper .

Editor information

Editors and Affiliations

Appendices

A Privacy Loss Budget Allocations

Table 2 lists the 7 person and 6 household queries that received direct allocations of the privacy loss budget. The allocations are shown in the \(PLB_{frac}\) column. The bold rows are the queries with the largest PLB allocation.

Table 2. Privacy loss budget allocations and scale parameters for 2010 DDP queries.

The \(Scale_{nation}\) and \(Scale_{county}\) columns list the scale factors used to generate the statistical distributions from which noise injection values are drawn. The \(Scale_{nation}\) values are used for the nation and state histograms, and the \(Scale_{county}\) values are used for the county, tract group, tract, block group, and block histograms.

The \(Hist_{size}\) column lists the number of cells in the particular query. This is the number of cells on each row of the histogram (i.e., for each geographic unit). The value is generated by multiplying together the number of categories for each variable in a query. For example, the Sex * Age (64 year bins) query has two categories for sex and two categories for age giving a histogram size of 4. Category counts for each variable are listed in frequently asked question 11 in  [7].

The variable names in the household queries are described in Table 3.

Table 3. Variable names and descriptions.

B Top-Down Algorithm flow diagram

Figure 3 depicts the flow of data through the noise injection and optimization steps of the Census Bureau’s Top-Down Algorithm.

Fig. 3.
figure 3

Census DAS optimization flow diagram.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Van Riper, D., Kugler, T., Ruggles, S. (2020). Disclosure Avoidance in the Census Bureau’s 2010 Demonstration Data Product. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics