Multivariate Top-Coding for Statistical Disclosure Limitation

Oganian, Anna; Iacob, Ionut; Lesaja, Goran

doi:10.1007/978-3-030-57521-2_10

Anna Oganian¹⁰,
Ionut Iacob¹¹ &
Goran Lesaja^11,12

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

799 Accesses

Abstract

One of the most challenging problems for national statistical agencies is how to release to the public microdata sets with a large number of attributes while keeping the disclosure risk of sensitive information of data subjects under control. When statistical agencies alter microdata in order to limit the disclosure risk, they need to take into account relationships between the variables to produce a good quality public data set. Hence, Statistical Disclosure Limitation (SDL) methods should not be univariate (treating each variable independently of others), but preferably multivariate, that is, handling several variables at the same time. Statistical agencies are often concerned about disclosure risk associated with the extreme values of numerical variables. Thus, such observations are often top or bottom-coded in the public use files. Top-coding consists of the substitution of extreme observations of the numerical variable by a threshold, for example, by the 99th percentile of the corresponding variable. Bottom coding is defined similarly but applies to the values in the lower tail of the distribution. We argue that a univariate form of top/bottom-coding may not offer adequate protection for some subpopulations which are different in terms of a top-coded variable from other subpopulations or the whole population. In this paper, we propose a multivariate form of top-coding based on clustering the variables into groups according to some metric of closeness between the variables and then forming the rules for the multivariate top-codes using techniques of Association Rule Mining within the clusters of variables obtained on the previous step. Bottom-coding procedures can be defined in a similar way. We illustrate our method on a genuine multivariate data set of realistic size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Grouping of Variables to Facilitate SDL Methods in Multivariate Data Sets

Propensity Score Based Conditional Group Swapping for Disclosure Limitation of Strata-Defining Variables

The Risk of Disclosure When Reporting Commonly Used Univariate Statistics

References

ACS: American community survey. United States Census Bureau. https://www.census.gov/programs-surveys/acs
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data - SIGMOD 1993, pp. 207–216, June 1993
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pp. 487–499, September 1994
Google Scholar
BRFSS: Behavioral risk factor surveillance system. Centers for Disease Control and Prevention (CDC). https://www.cdc.gov/brfss/index.html
Census: US census (1990) data set. UCI Machine Learning Repository (2017). https://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29
Chavent, M., Kuentz-Simonet, V., Liquet, B., Saracco, J.: ClustOfVar: an R package for the clustering of variables. J. Stat. Softw. 50(i13), 1–16 (2012)
Google Scholar
CPS: Current population survey. United States Census Bureau. https://www.census.gov/programs-surveys/cps.html
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017). http://archive.ics.uci.edu/ml
Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T.: Mining optimized association rules for numeric attributes. In: Proceedings of the 15th ACM SIGACTSIGMOD - SIGART PODS96, pp. 182–191. ACM Press (1996)
Google Scholar
Hájek, P., Havránek, T.: Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory. Springer, Heidelberg (1978). https://doi.org/10.1007/978-3-642-66943-9
Book MATH Google Scholar
Han, J.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD 2000, pp. 1–12 (2000)
Google Scholar
Hundepool, A., et al.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET, SDC project (2010). http://neon.vb.cbs.nl/casc
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)
Book Google Scholar
NHANES: National Health and Nutrition Examination Survey. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). https://www.cdc.gov/nchs/data/factsheets/factsheet_nhanes.htm
NHIS: National Health Interview Survey. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). https://www.cdc.gov/nchs/nhis/index.htm
Oganian, A., Iacob, I., Lesaja, G.: Grouping of variables to facilitate SDL methods in multivariate data sets. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 187–199. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_13
Chapter Google Scholar
Ross, M., Bateman, N.: Meet the low-wage workforce. Technical report, Brookings (2019)
Google Scholar
Salleb-Aouissi, A., Vrain, C., Nortet, C., Xiangrong Kong, X., Vivek Rathod, V., Cassard, D.: QuantMiner for mining quantitative association rules. J. Mach. Learn. Res. 14(61), 3153–3157 (2013). http://jmlr.org/papers/v14/salleb-aouissi13a.html
U.S. Department of Commerce Economics and Statistics Administration. BUREAU OF THE CENSUS: 1990 Census of Population and Housing. Public Use Microdata Samples. United States (1990)
Google Scholar
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372390 (2000)
Article Google Scholar

Download references

Acknowledgments

The findings and conclusions in this paper are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. The first author would like to thank Ellen Galantucci from the Bureau of Labor Statistics for the helpful discussion on the content of Sect. 3. Also we would like to thank John Pleis from the National Center for Health Statistics for the careful review of the paper.

Author information

Authors and Affiliations

National Center for Health Statistics, 3311 Toledo Road, Hyattsville, MD, 20782, USA
Anna Oganian
Department of Mathematical Sciences, Georgia Southern University, P.O. Box 8093, Statesboro, GA, 30460, USA
Ionut Iacob & Goran Lesaja
Mathematics Department, United States Naval Academy, 121 Blake Road, Annapolis, MD, 21402, USA
Goran Lesaja

Authors

Anna Oganian
View author publications
You can also search for this author in PubMed Google Scholar
Ionut Iacob
View author publications
You can also search for this author in PubMed Google Scholar
Goran Lesaja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Oganian .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

A Appendix. Variables in the Census Data Set Mentioned in the Paper

Class - Class of worker. Categories: 0 N/a, Unemployed who never worked. 1 Employee of a private for profit company. 2 Employee of a private not for profit company. 3 Local government employee. City, county, etc. 4 State government employee. 5 Federal government employee. 6 Self employed in own not incorporated business. 7 Self employed in own incorporated business. 8 Working without pay in family business or farm. 9 Unemployed, last worked in 1984 or earlier.

IndustryClass - Industry class. Categories: 1 Agriculture. 2 Mining. 3 Manufacturing. 4 Transportation. 5 Wholesale trade. 6 Retail trade. 7 Finance. 8 Business. 9 Personal services. 10 Entertainment. 11 Professional. 12 Public administration.

Occupclass - occupation class. Categories: 1 Managerial. 2 Professional. 3 Technical. 4 Service. 5 Farming. 6 Precision. 7 Operators. 8 Military.

Relat1 - Relationship to the householder. Categories: 0 Householder. 1 Husband/wife 2 Son/daughter 3 Stepson/stepdaughter 4 Brother/sister 5 Father/mother 6 Grandchild 7 Other relative 8 Roomer/boarder/foster child 9 Housemate/roommate 10 Unmarried partner 11 Other non related. 12 Institutionalized person. 13 Other person in group quarters.

Disable1 - Work limitation. Categories: 0 N/a. 1 Yes, Limited in kind or amount of work. 2 No, not Limited.

Rlabor - Employment status. Categories: 0 N/a 1 Civilian employee, at work. 2 Civilian employee, with a job but not at work. 3 Unemployed. 4 Armed forces, at work. 5 Armed forces, with a job but not at work. 6 Not in labor force.

Hour89 - Usual hours worked per week the year before the interview. This is a numerical variable with range from 0 to 99.

Week89 - Weeks worked the year before the interview. This is a numerical variable with range from 0 to 52.

Yearsch - educational attainment. Categories: 0 N/a. 1 No school completed. 2 Nursery school. 3 Kindergarten. 4 1st, 2nd, 3rd, or 4th grade. 5 5th, 6th, 7th, or 8th grade. 6 9th grade. 7 10th grade. 8 11th grade. 9 12th grade, No diploma. 10 High school graduate, diploma or GED. 11 Some College, But no degree. 12 Associate degree in College, Occupational. 13 Associate degree in College, Academic Program. 14 Bachelors degree. 15 Masters degree. 16 Professional degree. 17 Doctorate degree.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oganian, A., Iacob, I., Lesaja, G. (2020). Multivariate Top-Coding for Statistical Disclosure Limitation. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_10
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multivariate Top-Coding for Statistical Disclosure Limitation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Grouping of Variables to Facilitate SDL Methods in Multivariate Data Sets

Propensity Score Based Conditional Group Swapping for Disclosure Limitation of Strata-Defining Variables

The Risk of Disclosure When Reporting Commonly Used Univariate Statistics

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix. Variables in the Census Data Set Mentioned in the Paper

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multivariate Top-Coding for Statistical Disclosure Limitation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Grouping of Variables to Facilitate SDL Methods in Multivariate Data Sets

Propensity Score Based Conditional Group Swapping for Disclosure Limitation of Strata-Defining Variables

The Risk of Disclosure When Reporting Commonly Used Univariate Statistics

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix. Variables in the Census Data Set Mentioned in the Paper

A Appendix. Variables in the Census Data Set Mentioned in the Paper

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation