Skip to main content

Multivariate Top-Coding for Statistical Disclosure Limitation

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

  • 799 Accesses

Abstract

One of the most challenging problems for national statistical agencies is how to release to the public microdata sets with a large number of attributes while keeping the disclosure risk of sensitive information of data subjects under control. When statistical agencies alter microdata in order to limit the disclosure risk, they need to take into account relationships between the variables to produce a good quality public data set. Hence, Statistical Disclosure Limitation (SDL) methods should not be univariate (treating each variable independently of others), but preferably multivariate, that is, handling several variables at the same time. Statistical agencies are often concerned about disclosure risk associated with the extreme values of numerical variables. Thus, such observations are often top or bottom-coded in the public use files. Top-coding consists of the substitution of extreme observations of the numerical variable by a threshold, for example, by the 99th percentile of the corresponding variable. Bottom coding is defined similarly but applies to the values in the lower tail of the distribution. We argue that a univariate form of top/bottom-coding may not offer adequate protection for some subpopulations which are different in terms of a top-coded variable from other subpopulations or the whole population. In this paper, we propose a multivariate form of top-coding based on clustering the variables into groups according to some metric of closeness between the variables and then forming the rules for the multivariate top-codes using techniques of Association Rule Mining within the clusters of variables obtained on the previous step. Bottom-coding procedures can be defined in a similar way. We illustrate our method on a genuine multivariate data set of realistic size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. ACS: American community survey. United States Census Bureau. https://www.census.gov/programs-surveys/acs

  2. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data - SIGMOD 1993, pp. 207–216, June 1993

    Google Scholar 

  3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pp. 487–499, September 1994

    Google Scholar 

  4. BRFSS: Behavioral risk factor surveillance system. Centers for Disease Control and Prevention (CDC). https://www.cdc.gov/brfss/index.html

  5. Census: US census (1990) data set. UCI Machine Learning Repository (2017). https://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29

  6. Chavent, M., Kuentz-Simonet, V., Liquet, B., Saracco, J.: ClustOfVar: an R package for the clustering of variables. J. Stat. Softw. 50(i13), 1–16 (2012)

    Google Scholar 

  7. CPS: Current population survey. United States Census Bureau. https://www.census.gov/programs-surveys/cps.html

  8. Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017). http://archive.ics.uci.edu/ml

  9. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T.: Mining optimized association rules for numeric attributes. In: Proceedings of the 15th ACM SIGACTSIGMOD - SIGART PODS96, pp. 182–191. ACM Press (1996)

    Google Scholar 

  10. Hájek, P., Havránek, T.: Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory. Springer, Heidelberg (1978). https://doi.org/10.1007/978-3-642-66943-9

    Book  MATH  Google Scholar 

  11. Han, J.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD 2000, pp. 1–12 (2000)

    Google Scholar 

  12. Hundepool, A., et al.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET, SDC project (2010). http://neon.vb.cbs.nl/casc

  13. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)

    Book  Google Scholar 

  14. NHANES: National Health and Nutrition Examination Survey. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). https://www.cdc.gov/nchs/data/factsheets/factsheet_nhanes.htm

  15. NHIS: National Health Interview Survey. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). https://www.cdc.gov/nchs/nhis/index.htm

  16. Oganian, A., Iacob, I., Lesaja, G.: Grouping of variables to facilitate SDL methods in multivariate data sets. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 187–199. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_13

    Chapter  Google Scholar 

  17. Ross, M., Bateman, N.: Meet the low-wage workforce. Technical report, Brookings (2019)

    Google Scholar 

  18. Salleb-Aouissi, A., Vrain, C., Nortet, C., Xiangrong Kong, X., Vivek Rathod, V., Cassard, D.: QuantMiner for mining quantitative association rules. J. Mach. Learn. Res. 14(61), 3153–3157 (2013). http://jmlr.org/papers/v14/salleb-aouissi13a.html

  19. U.S. Department of Commerce Economics and Statistics Administration. BUREAU OF THE CENSUS: 1990 Census of Population and Housing. Public Use Microdata Samples. United States (1990)

    Google Scholar 

  20. Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372390 (2000)

    Article  Google Scholar 

Download references

Acknowledgments

The findings and conclusions in this paper are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. The first author would like to thank Ellen Galantucci from the Bureau of Labor Statistics for the helpful discussion on the content of Sect. 3. Also we would like to thank John Pleis from the National Center for Health Statistics for the careful review of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Oganian .

Editor information

Editors and Affiliations

A Appendix. Variables in the Census Data Set Mentioned in the Paper

A Appendix. Variables in the Census Data Set Mentioned in the Paper

Class - Class of worker. Categories: 0 N/a, Unemployed who never worked. 1 Employee of a private for profit company. 2 Employee of a private not for profit company. 3 Local government employee. City, county, etc. 4 State government employee. 5 Federal government employee. 6 Self employed in own not incorporated business. 7 Self employed in own incorporated business. 8 Working without pay in family business or farm. 9 Unemployed, last worked in 1984 or earlier.

IndustryClass - Industry class. Categories: 1 Agriculture. 2 Mining. 3 Manufacturing. 4 Transportation. 5 Wholesale trade. 6 Retail trade. 7 Finance. 8 Business. 9 Personal services. 10 Entertainment. 11 Professional. 12 Public administration.

Occupclass - occupation class. Categories: 1 Managerial. 2 Professional. 3 Technical. 4 Service. 5 Farming. 6 Precision. 7 Operators. 8 Military.

Relat1 - Relationship to the householder. Categories: 0 Householder. 1 Husband/wife 2 Son/daughter 3 Stepson/stepdaughter 4 Brother/sister 5 Father/mother 6 Grandchild 7 Other relative 8 Roomer/boarder/foster child 9 Housemate/roommate 10 Unmarried partner 11 Other non related. 12 Institutionalized person. 13 Other person in group quarters.

Disable1 - Work limitation. Categories: 0 N/a. 1 Yes, Limited in kind or amount of work. 2 No, not Limited.

Rlabor - Employment status. Categories: 0 N/a 1 Civilian employee, at work. 2 Civilian employee, with a job but not at work. 3 Unemployed. 4 Armed forces, at work. 5 Armed forces, with a job but not at work. 6 Not in labor force.

Hour89 - Usual hours worked per week the year before the interview. This is a numerical variable with range from 0 to 99.

Week89 - Weeks worked the year before the interview. This is a numerical variable with range from 0 to 52.

Yearsch - educational attainment. Categories: 0 N/a. 1 No school completed. 2 Nursery school. 3 Kindergarten. 4 1st, 2nd, 3rd, or 4th grade. 5 5th, 6th, 7th, or 8th grade. 6 9th grade. 7 10th grade. 8 11th grade. 9 12th grade, No diploma. 10 High school graduate, diploma or GED. 11 Some College, But no degree. 12 Associate degree in College, Occupational. 13 Associate degree in College, Academic Program. 14 Bachelors degree. 15 Masters degree. 16 Professional degree. 17 Doctorate degree.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oganian, A., Iacob, I., Lesaja, G. (2020). Multivariate Top-Coding for Statistical Disclosure Limitation. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics