Disclosure Avoidance in the Census Bureau’s 2010 Demonstration Data Product

Van Riper, David; Kugler, Tracy; Ruggles, Steven

doi:10.1007/978-3-030-57521-2_25

Disclosure Avoidance in the Census Bureau’s 2010 Demonstration Data Product

Conference paper
First Online: 16 September 2020

828 Accesses
6 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Abstract

Producing accurate, usable data while protecting respondent privacy are dual mandates of the US Census Bureau. In 2019, the Census Bureau announced it would use a new disclosure avoidance technique, based on differential privacy, for the 2020 Decennial Census of Population and Housing [19]. Instead of suppressing data or swapping sensitive records, differentially private methods inject noise into counts to protect privacy. Unfortunately, noise injection may also make the data less useful and accurate. This paper describes the differentially private Disclosure Avoidance System (DAS) used to prepare the 2010 Demonstration Data Product (DDP). It describes the policy decisions that underlie the DAS and how the DAS uses those policy decisions to produce differentially private data. Finally, it discusses usability and accuracy issues in the DDP, with a focus on occupied housing unit counts. Occupied housing unit counts in the DDP differed greatly from 2010 Summary File 1 differed greatly, and the paper explains possible sources of the differences.

Supported by the Minnesota Population Center (R24 HD041023), funded through grants from the Eunice Kennedy Shriver National Institute for Child Health and Human Development.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Space constraints prevent us from a complete discussion of the Bureau’s disclosure avoidance techniques. Interested readers are directed to McKenna [24, 25].
2.
The database reconstruction theorem states that respondent privacy is compromised when too many accurate statistics are published from the confidential data. For the 2010 decennial census, more than 150 billion statistics were published [23].
3.
The Bureau reconstructed microdata from a set of 2010 decennial tables. Tables P1, P6, P7, P9, P11, P12, P12A-I, and P14 for census blocks and PCT12A-N for census tracts were used in the reconstruction [23].
4.
The Bureau linked the two datasets by exact age and by age plus or minus one year. [23].
5.
Readers interested in learning more about differential privacy are directed to Wood et al. [32] and Reiter [27]. These papers provide a relatively non-technical introduction to the topic. A critique of differential privacy can be found in Bambauer et al. [5].
6.
The Census Bureau executed a reconstruction and re-identification attack on the 2010 Demonstration Data Product, which was generated from a differentially private algorithm. Approximately 5% of the reconstructed microdata records were successfully matched to confidential data. The 5% re-identification rate represents an improvement over the 17% re-identified from the 2010 decennial census data, but it still represents approximately 15 million census respondents [21].
7.
The 2010 Demonstration Data Product was the Census Bureau’s third dataset produced by the Disclosure Avoidance System (DAS). The DAS consists of the Bureau’s differentially private algorithm and the post-processing routines required to enforce constraints. The first dataset contained tabulations from the 2018 Census Test enumeration phase, carried out in Providence County, Rhode Island. The second dataset consists of multiple runs of the DAS over the 1940 complete-count census microdata from IPUMS. Details are available in [3, 22].
8.
The Demographic and Housing Characteristics dataset is the replacement for Summary File 1.
9.
All decennial census products, except for Congressional apportionment counts, are derived from the Census Edited File (CEF). The CEF is produced through a series of imputations and allocations that fill in missing data from individual census returns and resolve inconsistencies. Readers interested in a more detailed discussion of the CEF production are directed to pages 10–11 of boyd [6].
10.
The National Academies of Sciences, Engineering, and Medicine’s Committee on National Statistics (CNStat) hosted a 2-day workshop on December 11–12, 2019. Census Bureau staff members presented details of the algorithm used to create the DDP. Census data users presented results from analyses that compared the 2010 DDP with 2010 Summary File 1 and PL94-171 data products. Privacy experts discussed issues surrounding the decennial census and potential harms of re-identification. Videos and slides from the workshop are available at https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518.
11.
Technically, \(\epsilon \) must be greater than 0. If \(\epsilon \) was zero, then no data would be published.
12.
The census tract group is not a standard unit in the Census Bureau’s geographic hierarchy. It was created specifically for the DAS to control the number of child units for each county. The census tract group consists of all census tracts with the same first four digits of their code (e.g., tract group 1001 consists of tracts 1001.01 and 1001.02). The DDP does not include data for tract groups.
13.
At the time the DAS for the 2010 Demonstration Data Product was designed, the Census Bureau assumed the citizenship question would be included on the 2020 Decennial Census questionnaire. Even though the US Supreme Court ruled in favor of the plaintiffs and removed the question, the Bureau did not have time to remove the citizenship variable from the DAS. No actual citizenship data was used to create the 2010 DDP; instead, the Bureau imputed citizenship status for records in the CEF [10].
14.
The geographic level associated with an invariant is the lowest level at which the invariant holds. All geographic levels composed of the lowest level will also be invariant.
15.
Sensitivity is the value by which a query changes if we make a single modification to the database. Histogram queries have a sensitivity of 2 - if we increase the count in a cell by 1, we must decrease the count in another cell by 1.
16.
Two types of distributions - the two-tailed geometric and the Laplace - are typically used to achieve differential privacy. The two-tailed geometric distribution is used when integers are required, and the Laplace distribution is used when real numbers are required. Source code for the 2010 DDP includes functions for both types of distributions [11].
17.
The optimization problem is actually solved in two stages, one to enforce non-negativity and optimize over the set of queries and a second stage to produce an integer-valued detailed histogram.
18.
The Census Bureau fielded so many questions about occupancy rates that they added a question to their FAQ [7]. The answer mentions that Census would look into the issue and post answers or updates. As of 2020-05-22, no answers or updates have been posted.
19.
Readers interested in learning more about the discrepancy should watch Beth Jarosz’ presentation at the December 2019 CNStat workshop on the 2010 Demonstration Data Product [20].

References

Abowd, J.: Disclosure avoidance for block level data and protection of confidentiality in public tabulations, December 2018. https://www2.census.gov/cac/sac/meetings/2018-12/abowd-disclosure-avoidance.pdf
Abowd, J.: Protecting the Confidentiality of America’s Statistics: Adopting Modern Disclosure Avoidance Methods at the Census Bureau (2018). https://www.census.gov/newsroom/blogs/research-matters/2018/08/protecting_the_confi.html
Abowd, J., Garfinkel, S.: Disclosure Avoidance and the 2018 Census Test: Release of the Source Code (2019). https://www.census.gov/newsroom/blogs/research-matters/2019/06/disclosure_avoidance.html
Akee, R.: Population counts on American Indian Reservations and Alaska Native Villages, with and without the application of differential privacy. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Bambauer, J., Muralidhar, K., Sarathy, R.: Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law 16, 55 (2014)
Google Scholar
Boyd, D.: Balancing data utility and confidentiality in the 2020 US census. Technical report, Data and Society, New York, NY, December 2019
Google Scholar
Census Bureau: Frequently Asked Questions for the 2010 Demonstration Data Products. https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products/faqs.html
Census Bureau: Standard hierarchy of census geographic entities. Technical report, US Census Bureau, Washington DC, July 2010
Google Scholar
Census Bureau: 2010 Demonstration Data Products (2019). https://www.census.gov/programs-surveys/decennial-census/2020-census/planning-management/2020-census-data-products/2010-demonstration-data-products.html
Census Bureau: 2010 demonstration P.L. 94–171 redistricting summary file and demographic and housing demonstration file: technical documentation. Technical report, US Department of Commerce, Washington, DC, October 2019
Google Scholar
Census Bureau: 2020 Census 2010 Demonstration Data Products Disclosure Avoidance System. US Census Bureau (2019)
Google Scholar
Census Bureau: 2020 census 2010 demonstration data products disclosure avoidance system: design specification, version 1.4. Technical report, US Census Bureau, Washington DC (2019)
Google Scholar
Cormode, G.: Building blocks of privacy: differentially private mechanisms. Technical report, Rutgers University, Brunswick, NJ (nd)
Google Scholar
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS 2003, pp. 202–210. ACM, New York (2003). https://doi.org/10.1145/773153.773173
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Fontenot Jr., A.: 2010 demonstration data products - design parameters and global privacy-loss budget. Technical report 2019.25, US Census Bureau, Washington, DC, October 2019
Google Scholar
Garfinkel, S., Abowd, J.M., Martindale, C.: Understanding database reconstruction attacks on public data. ACM Queue 16(5), 28–53 (2018)
Article Google Scholar
Garfinkel, S.L., Abowd, J.M., Powazek, S.: Issues encountered deploying differential privacy. In: Proceedings of the 2018 Workshop on Privacy in the Electronic Society. WPES 2018, pp. 133–137. ACM, New York (2018). https://doi.org/10.1145/3267323.3268949
Jarmin, R.: Census Bureau Adopts Cutting Edge Privacy Protections for 2020 Census (2019). https://www.census.gov/newsroom/blogs/random-samplings/2019/02/census_bureau_adopts.html
Jarosz, B.: Importance of decennial census for regional planning in California. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Leclerc, P.: The 2020 decennial census topdown disclosure limitation algorithm: a report on the current state of the privacy loss-accuracy trade-off. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington DC, December 2019
Google Scholar
Leclerc, P.: Guide to the census 2018 end-to-end test disclosure avoidance algorithm and implementation. Technical report, US Census Bureau, Washington DC, July 2019
Google Scholar
Leclerc, P.: Reconstruction of person level data from data presented in multiple tables. In: Challenges and New Approaches for Protecting Privacy in Federal Statistical Programs: A Workshop, Washington, DC, June 2019
Google Scholar
McKenna, L.: Disclosure avoidance techniques used for the 1970 through 2010 decennial censuses of population and housing. Technical report 18–47, US Census Bureau, Washington, DC (2018)
Google Scholar
McKenna, L.: Disclosure avoidance techniques used for the 1960 through 2010 decennial censuses of population and housing public use microdata samples. Technical report, US Census Bureau, Washington, DC (2019)
Google Scholar
Nagle, N.: Implications for municipalities and school enrollment statistics. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Reiter, J.P.: Differential privacy and federal data releases. Ann. Rev. Stat. Appl. 6(1), 85–101 (2019). https://doi.org/10.1146/annurev-statistics-030718-105142
Article MathSciNet Google Scholar
Sandberg, E.: Privatized data for Alaska communities. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Santos-Lozada, A.: Differential privacy and mortality rates in the United States. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Santos-Lozada, A.R., Howard, J.T., Verdery, A.M.: How differential privacy will affect our understanding of health disparities in the United States. Proc. Natl. Acad. Sci. (2020). https://doi.org/10.1073/pnas.2003714117
Article Google Scholar
Spielman, S., Van Riper, D.: Geographic review of differentially private demonstration data. In: Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations, Washington, DC, December 2019
Google Scholar
Wood, A., et al.: Differential privacy: a primer for a non-technical audience. Vanderbilt J. Entertain. Technol. Law 21(1), 209–276 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Minnesota Population Center, University of Minnesota, Minneapolis, MN, 55455, USA
David Van Riper, Tracy Kugler & Steven Ruggles

Authors

David Van Riper
View author publications
You can also search for this author in PubMed Google Scholar
Tracy Kugler
View author publications
You can also search for this author in PubMed Google Scholar
Steven Ruggles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Van Riper .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Appendices

A Privacy Loss Budget Allocations

Table 2 lists the 7 person and 6 household queries that received direct allocations of the privacy loss budget. The allocations are shown in the \(PLB_{frac}\) column. The bold rows are the queries with the largest PLB allocation.

Table 2. Privacy loss budget allocations and scale parameters for 2010 DDP queries.

Full size table

The \(Scale_{nation}\) and \(Scale_{county}\) columns list the scale factors used to generate the statistical distributions from which noise injection values are drawn. The \(Scale_{nation}\) values are used for the nation and state histograms, and the \(Scale_{county}\) values are used for the county, tract group, tract, block group, and block histograms.

The \(Hist_{size}\) column lists the number of cells in the particular query. This is the number of cells on each row of the histogram (i.e., for each geographic unit). The value is generated by multiplying together the number of categories for each variable in a query. For example, the Sex * Age (64 year bins) query has two categories for sex and two categories for age giving a histogram size of 4. Category counts for each variable are listed in frequently asked question 11 in [7].

The variable names in the household queries are described in Table 3.

Table 3. Variable names and descriptions.

Full size table

B Top-Down Algorithm flow diagram

Figure 3 depicts the flow of data through the noise injection and optimization steps of the Census Bureau’s Top-Down Algorithm.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Riper, D., Kugler, T., Ruggles, S. (2020). Disclosure Avoidance in the Census Bureau’s 2010 Demonstration Data Product. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_25
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics