Abstract
The exponential growth of collected, processed, and shared microdata has given rise to concerns about individuals’ privacy. As a result, laws and regulations have emerged to control what organisations do with microdata and how they protect it. Statistical Disclosure Control seeks to reduce the risk of confidential information disclosure by de-identifying them. Such de-identification is guaranteed through privacy-preserving techniques (PPTs). However, de-identified data usually results in loss of information, with a possible impact on data analysis precision and model predictive performance. The main goal is to protect the individual’s privacy while maintaining the interpretability of the data (i.e., its usefulness). Statistical Disclosure Control is an area that is expanding and needs to be explored since there is still no solution that guarantees optimal privacy and utility. This survey focuses on all steps of the de-identification process. We present existing PPTs used in microdata de-identification, privacy measures suitable for several disclosure types, and information loss and predictive performance measures. In this survey, we discuss the main challenges raised by privacy constraints, describe the main approaches to handle these obstacles, review the taxonomies of PPTs, provide a theoretical analysis of existing comparative studies, and raise multiple open issues.
- [1] . 1989. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21, 4 (1989), 515–556.Google ScholarDigital Library
- [2] . 2008. Privacy-Preserving Data Mining: Models and Algorithms. Springer Science & Business Media.Google ScholarCross Ref
- [3] . 2021. Aircloak. Retrieved November 1, 2021 from https://aircloak.com/.Google Scholar
- [4] . 2018. An efficient approach for publishing microdata for multiple sensitive attributes. Journal of Supercomputing 74, 10 (2018), 5127–5155.Google ScholarDigital Library
- [5] . 2017. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning. 214–223.Google Scholar
- [6] . 2013. ARX Data Anonymization Tool. Retrieved November 1, 2021 from https://arx.deidentifier.org/.Google Scholar
- [7] . 2002. Re-identifying register data by survey data using cluster analysis: An empirical study. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 589–607.Google ScholarDigital Library
- [8] . 2019. Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS’19). 15453–15462. https://proceedings.neurips.cc/paper/2019/hash/fc0de4e0396fff257ea362983c2dda5a-Abstract.html.Google Scholar
- [9] . 2021. A blockchain empowered and privacy preserving digital contact tracing platform. Information Processing & Management 58, 4 (2021), 102572.Google ScholarDigital Library
- [10] . 2005. Data privacy through optimal k-anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). IEEE, Los Alamitos, CA, 217–228.Google ScholarDigital Library
- [11] . 2019. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes 12, 7 (2019), e005122.Google ScholarCross Ref
- [12] . 2019. Privacy and synthetic datasets. Stanford Technology Law Review 22 (2019), 1.Google Scholar
- [13] . 1998. Individual risk of disclosure using sampling design information. Contributi Istat 1412003 (1998), 1–15.Google Scholar
- [14] . 2019. Statistical disclosure control: A practice guide. Read the Docs. Retrieved April 5, 2023 from https://buildmedia.readthedocs.org/media/pdf/sdcpractice/latest/sdcpractice.pdf.Google Scholar
- [15] . 1990. Disclosure control of microdata. Journal of the American Statistical Association 85, 409 (1990), 38–45.Google ScholarCross Ref
- [16] . 2022. A critical review on the use (and misuse) of differential privacy in machine learning. arXiv preprint arXiv:2206.04621 (2022).Google Scholar
- [17] . 2022. Private sampling: A noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science 4, 3 (2022), 1082–1115.Google ScholarCross Ref
- [18] . 2002. Microdata protection through noise addition. In Inference Control in Statistical Databases. Springer, 97–116.Google ScholarCross Ref
- [19] . 2008. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 70–78.Google ScholarDigital Library
- [20] . 2019. Privacy preserving data publishing with multiple sensitive attributes based on overlapped slicing. Information 10, 12 (2019), 362.Google ScholarCross Ref
- [21] . 2012. Effects of data anonymization on the data mining results. In Proceedings of the 2012 35th International Convention MIPRO. IEEE, Los Alamitos, CA, 1619–1623.Google Scholar
- [22] . 2012. Publishing microdata with a robust privacy guarantee. Proceedings of the VLDB Endowment 5, 11 (2012), 1388–1399.Google Scholar
- [23] . 2021. Fundamental privacy rights in a pandemic state. PLoS One 16, 6 (2021), e0252169.Google ScholarCross Ref
- [24] . 2021. The compromise of data privacy in predictive performance. In Advances in Intelligent Data Analysis XIX, , , , and (Eds.). Springer International Publishing, Cham, Switzerland, 426–438.Google ScholarDigital Library
- [25] . 2022. Towards a data privacy-predictive performance trade-off.
arxiv:2201.05226 [cs.LG] (2022).Google Scholar - [26] . 2022. Privacy-preserving data synthetisation for secure information sharing. arXiv preprint arXiv:2212.00484 (2022).Google Scholar
- [27] . 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.Google ScholarDigital Library
- [28] . 2017. Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the Machine Learning for Healthcare Conference. 286–305.Google Scholar
- [29] . 2009. Cornell Anonymization Toolkit. Retrieved November 1, 2021 from https://sourceforge.net/projects/anony-toolkit/.Google Scholar
- [30] . 1981. Convention for the Protection of Individuals with Regard to Automatic Processing of Personal Data. Retrieved December 1, 2022 from https://rm.coe.int/1680078b37.Google Scholar
- [31] . 1980. Suppression methodology and statistical disclosure control. Journal of the American Statistical Association 75, 370 (1980), 377–385.Google ScholarCross Ref
- [32] . 2019. The power of microdata: An introduction. In Data-Driven Policy Impact Evaluation. Springer, Cham, Switzerland, 1–14.Google ScholarCross Ref
- [33] . 2021. A survey of privacy-preserving mechanisms for heterogeneous data types. Computer Science Review 41 (2021), 100403.Google ScholarDigital Library
- [34] . 1981. A simple procedure for controlled rounding. Statistik Tidskrift 3 (1981), 202–208.Google Scholar
- [35] . 1982. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference 6, 1 (1982), 73–85.Google ScholarCross Ref
- [36] . 2002. LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In Inference Control in Statistical Databases. Springer, 153–162.Google Scholar
- [37] . 2014. Privacy and Data Protection by Design—From Policy to Engineering. European Union Agency for Network and Information Security (ENISA), Heraklion, Greece.Google Scholar
- [38] . 2012. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making 12, 1 (2012), 1–15.Google ScholarCross Ref
- [39] . 2021. Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences 11, 5 (2021), 2158.Google ScholarCross Ref
- [40] . 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence2 (1979), 224–227.Google ScholarDigital Library
- [41] . 1996. Argus: Software for statistical disclosure control of microdata. In Proceedings of the 1996 Annual Research Conference.Google Scholar
- [42] . 1996. A view on statistical disclosure control for microdata. Survey Methodology 22, 1 (1996), 95–103.Google Scholar
- [43] . 1993. Panels of enterprises and confidentiality: The small aggregates method. In Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys. 195–204.Google Scholar
- [44] . 2008. A survey of inference control methods for privacy-preserving data mining. In Privacy-Preserving Data Mining. Springer, 53–80.Google Scholar
- [45] . 2019. Privacy-preserving cloud computing on sensitive data: A survey of methods, products and challenges. Computer Communications 140 (2019), 38–60.Google ScholarDigital Library
- [46] . 2010. Hybrid microdata using microaggregation. Information Sciences 180, 15 (2010), 2834–2844.Google ScholarDigital Library
- [47] . 2006. Efficient multivariate data-oriented microaggregation. VLDB Journal 15, 4 (2006), 355–369.Google ScholarDigital Library
- [48] . 2002. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering 14, 1 (2002), 189–201.Google ScholarDigital Library
- [49] . 2001. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In Pre-Proceedings of ETK-NTTS, Vol. 2. 807–826.Google Scholar
- [50] . 2002. On the security of microaggregation with individual ranking: Analytical attacks. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 477–491.Google ScholarDigital Library
- [51] . 2016. Database anonymization: Privacy models, data utility, and microaggregation-based inter-model connections. Synthesis Lectures on Information Security, Privacy, and Trust 8, 1 (2016), 1–136.Google ScholarCross Ref
- [52] . 2001. Disclosure control methods and information loss for microdata. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies 2001 (2001), 91–110.Google Scholar
- [53] . 2002. Distance-based and probabilistic record linkage for re-identification of records with categorical variables. Butlletí de lACIA, Associació Catalana dIntelligència Artificial 2002 (2002), 243–250.Google Scholar
- [54] . 2004. Disclosure risk assessment in statistical data protection. Journal of Computational and Applied Mathematics 164 (2004), 285–293.Google ScholarDigital Library
- [55] . 2001. Disclosure limitation methods and information loss for tabular data. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies 2001 (2001), 135–166.Google Scholar
- [56] . 2010. Dissemination of Microdata Files: Principles Procedures and Practices. International Household Survey Network.Google Scholar
- [57] . 2006. Differential privacy. In Automata, Languages and Programming. Lecture Notes in Computer Science, Vol. 4052. Springer, 1–12.Google ScholarDigital Library
- [58] . 2008. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15, 5 (2008), 627–637.Google ScholarCross Ref
- [59] . 2002. A computational algorithm for handling the special uniques problem. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 493–509.Google ScholarDigital Library
- [60] . 2014. Guidelines on output checking. CROS. Retrieved November 1, 2022 from https://ec.europa.eu/eurostat/cros/content/guidelines-output-checking_en.Google Scholar
- [61] . 2014. Opinion 05/2014 on Anonymisation Techniques. Retrieved February 5, 2021 from https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.Google Scholar
- [62] . 2017. Guidelines on Personal Data Breach Notification Under Regulation 2016/679 (wp250rev.01). Retrieved September 1, 2021 from https://ec.europa.eu/newsroom/article29/item-detail.cfm?item_id=612052.Google Scholar
- [63] . 2021. Statistical Disclosure Control for Business Microdata. Retrieved September 1, 2021 from https://ec.europa.eu/eurostat/documents/54610/7779382/Statistical-Disclosure-Control-in-business-statistics.pdf.Google Scholar
- [64] . 2022. Microdata Access. Retrieved November 1, 2022 from https://ec.europa.eu/eurostat/cros/content/microdata-access_en.Google Scholar
- [65] . 2021. Guidelines 07/2020 on the Concepts of Controller and Processor in the GDPR. Retrieved October 1, 2021 from https://edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-072020-concepts-controller-and-processor-gdpr_en.Google Scholar
- [66] . 2022. Accountability. Retrieved December 1, 2022 from https://edps.europa.eu/data-protection/our-work/subjects/accountability_en.Google Scholar
- [67] . 1995. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. EUR-Lex. Retrieved December 1, 2022 from https://eur-lex.europa.eu/eli/dir/1995/46/oj.Google Scholar
- [68] . 1990. Population genetics theory—The past and the future. In Mathematical and Statistical Developments of Evolutionary Theory. Springer, 177–227.Google ScholarCross Ref
- [69] . 2021. Microaggregation heuristic applied to statistical disclosure control. Information Sciences 548 (2021), 37–55.Google ScholarCross Ref
- [70] . 2022. DP-CTGAN: Differentially private medical data generation using CTGANs. In Proceedings of the International Conference on Artificial Intelligence in Medicine. 178–188.Google ScholarDigital Library
- [71] . 1969. A theory for record linkage. Journal of the American Statistical Association 64, 328 (1969), 1183–1210.Google ScholarCross Ref
- [72] . 2004. Data swapping: Variations on a theme by Dalenius and Reiss. In Privacy in Statistical Databases, and (Eds.). Springer, Berlin, Germany, 14–29.Google ScholarCross Ref
- [73] . 2022. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10, 15 (2022), 2733.Google Scholar
- [74] . 2019. Privacy of trajectory micro-data: A survey.
arxiv:1903.12211 (2019).Google Scholar - [75] . 2015. Measuring information quality for privacy preserving data mining. International Journal of Computer Theory and Engineering 7, 1 (2015), 21.Google ScholarCross Ref
- [76] . 2011. Disclosure risk for high dimensional business microdata. In Proceedings of the Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality.26–28.Google Scholar
- [77] . 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78, 383 (1983), 553–569.Google ScholarCross Ref
- [78] . 2014. Privacy in pharmacogenetics: An end-to-end study of personalized warfarin dosing. In Proceedings of the 23rd USENIX Security Symposium (USENIX Security’14). 17–32.Google Scholar
- [79] . 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4 (2010), 1–53.Google ScholarDigital Library
- [80] . 2010. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, Boca Raton, FL.Google ScholarCross Ref
- [81] . 2008. A framework for privacy-preserving cluster analysis. In Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics. IEEE, Los Alamitos, CA, 46–51.Google ScholarCross Ref
- [82] . 2009. Privacy-preserving data publishing for cluster analysis. Data & Knowledge Engineering 68, 6 (2009), 552–575.Google ScholarDigital Library
- [83] . 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). IEEE, Los Alamitos, CA, 205–216.Google ScholarDigital Library
- [84] . 2020. Remote work and employment dynamics under COVID-19: Evidence from Canada. Canadian Public Policy 46, S1 (2020), 44–54.Google ScholarCross Ref
- [85] . 2021. A need for open public data standards and sharing in light of COVID-19. Lancet Infectious Diseases 21, 4 (2021), e80.Google ScholarCross Ref
- [86] . 1998. Post randomisation for statistical disclosure control: Theory and implementation. Journal of Official Statistics 14, 4 (1998), 463.Google Scholar
- [87] . 2019. Gretel. Accessed December 1, 2022 from https://gretel.ai/.Google Scholar
- [88] . 2020. Gretel Synthetics. Retrieved December 1, 2022 from https://github.com/gretelai/gretel-synthetics.Google Scholar
- [89] . 2010. Privacy-preserving record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. 269–283.Google ScholarCross Ref
- [90] . 2013. SLOMS: A privacy preserving data publishing method for multiple sensitive attributes microdata. Journal of Software 8, 12 (2013), 3096–3104.Google ScholarCross Ref
- [91] . 2003. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering 15, 4 (2003), 1043–1044.Google ScholarDigital Library
- [92] . 2012. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 25.Google Scholar
- [93] . 2016. An effective value swapping method for privacy preserving data publishing. Security and Communication Networks 9, 16 (2016), 3219–3228.Google ScholarDigital Library
- [94] . 2012. Permutation anonymization: Improving anatomy for privacy preservation in data publication. In New Frontiers in Applied Data Mining, , , , , and (Eds.). Springer, Berlin, Germany, 111–123.Google Scholar
- [95] . 1993. A bootstrap procedure to preserve statistical confidentiality in contingency tables. In Proceedings of the International Seminar on Statistical Confidentiality. 261–271.Google Scholar
- [96] . 2007. Data Quality and Record Linkage Techniques. Springer Science & Business Media.Google ScholarDigital Library
- [97] . 2019. On the utility of synthetic data: An empirical evaluation on machine learning tasks. In Proceedings of the 14th International Conference on Availability, Reliability, and Security. 1–6.Google ScholarDigital Library
- [98] . 2019. Utility and privacy assessments of synthetic data for regression tasks. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 5763–5772.Google ScholarCross Ref
- [99] . 1969. Computers and privacy: A survey. ACM Computing Surveys 1, 2 (1969), 85–103.Google ScholarDigital Library
- [100] . 2001. Applying Pitman’s sampling formula to microdata disclosure risk assessment. Journal of Official Statistics 17, 4 (2001), 499.Google Scholar
- [101] . 2019. A survey on interdependent privacy. ACM Computing Surveys 52, 6 (2019), 1–40.Google ScholarDigital Library
- [102] . 2010. Handbook on Statistical Disclosure Control. ESSnet on Statistical Disclosure Control.Google Scholar
- [103] . 2012. Statistical Disclosure Control. Vol. 2. Wiley, New York, NY.Google ScholarCross Ref
- [104] . 1998. Models and methods for the microdata protection problem. Journal of Official Statistics 14, 4 (1998), 437.Google Scholar
- [105] , , and (Eds.). 2018. Automated Machine Learning: Methods, Systems, Challenges. Springer.Google Scholar
- [106] . 2009. Disclosure control of business microdata: A density-based approach. International Statistical Review 77, 2 (2009), 196–211.Google ScholarCross Ref
- [107] . 2019. Publishing differentially private datasets via stable microaggregation. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 662–665.Google Scholar
- [108] . 2009. Using anonymized data for classification. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 429–440.Google ScholarDigital Library
- [109] . 2022. Accountability and governance. ICO. Retrieved December 1, 2022 from https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/accountability-and-governance/.Google Scholar
- [110] . 2022. What does it mean if you are a controller? ICO. Retrieved December 1, 2022 from https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/controllers-and-processors/what-does-it-mean-if-you-are-a-controller/.Google Scholar
- [111] . 2014. Data swapping as a more efficient tool to create anonymized census microdata in Japan. In Proceedings of Privacy in Statistical Databases. 1–14.Google Scholar
- [112] . 2018. Comparative study of the effectiveness of perturbative methods for creating official microdata in Japan. In Privacy in Statistical Databases, and (Eds.). Springer International Publishing, Cham, Switzerland, 200–214.Google ScholarDigital Library
- [113] . 2002. Transforming data to satisfy privacy constraints. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 279–288.Google ScholarDigital Library
- [114] . 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84, 406 (1989), 414–420.Google ScholarCross Ref
- [115] . 2018. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [116] . 2020. Too much information: Assessing privacy risks of contact trace data disclosure on people with COVID-19 in South Korea. Frontiers in Public Health 8 (2020), 305.Google Scholar
- [117] . 1955. Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6, 2 (1955), 93–101.Google ScholarCross Ref
- [118] . 2006. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 217–228.Google ScholarDigital Library
- [119] . 1986. A method for limiting disclosure in microdata based on random noise and transformation. In Proceedings of the Section on Survey Research Methods. American Statistical Association, Alexandria, VA, 303–308.Google Scholar
- [120] . 2022. PriveTAB: Secure and privacy-preserving sharing of tabular data. In Proceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics. 35–45.Google ScholarDigital Library
- [121] . 2013. sdcMicroGUI: Graphical user interface for package sdcMicro. Retrieved April 5, 2023 from https://rdrr.io/cran/sdcMicroGUI/.Google Scholar
- [122] . 1998. Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 2 (1998), 195–215.Google ScholarDigital Library
- [123] . 1951. On information and sufficiency. Annals of Mathematical Statistics 22, 1 (1951), 79–86.Google ScholarCross Ref
- [124] . 2021. Effective and privacy preserving tabular data synthesizing. arXiv preprint arXiv:2108.10064 (2021).Google Scholar
- [125] . 2005. Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering 17, 7 (2005), 902–911.Google ScholarDigital Library
- [126] . 2009. Approximation bounds for minimum information loss microaggregation. IEEE Transactions on Knowledge and Data Engineering 21, 11 (2009), 1643–1647.Google ScholarDigital Library
- [127] . 2011. How much is enough? Choosing \(\varepsilon\) for differential privacy. In Proceedings of the International Conference on Information Security. 325–340.Google ScholarCross Ref
- [128] . 2005. Incognito: Efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 49–60.Google ScholarDigital Library
- [129] . 2006. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 25–25.Google ScholarDigital Library
- [130] . 2006. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 277–286.Google ScholarDigital Library
- [131] . 2023. Local generalization and bucketization technique for personalized privacy preservation. Journal of King Saud University: Computer and Information Sciences 35, 1 (2023), 393–404.Google Scholar
- [132] . 2017. Cross-bucket generalization for information and privacy preservation. IEEE Transactions on Knowledge and Data Engineering 30, 3 (2017), 449–459.Google ScholarCross Ref
- [133] . 2011. Information based data anonymization for classification utility. Data & Knowledge Engineering 70, 12 (2011), 1030–1045.Google ScholarDigital Library
- [134] . 2011. Identity matching using personal and social identity features. Information Systems Frontiers 13, 1 (2011), 101–113.Google ScholarDigital Library
- [135] . 2007. T-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering. IEEE, Los Alamitos, CA, 106–115.Google ScholarCross Ref
- [136] . 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517–526.Google ScholarDigital Library
- [137] . 2010. Slicing: A new approach for privacy preserving data publishing. IEEE Transactions on Knowledge and Data Engineering 24, 3 (2010), 561–574.Google ScholarDigital Library
- [138] . 2018. Location and trajectory privacy preservation in 5G-enabled vehicle social network services. Journal of Network and Computer Applications 110 (2018), 108–118.Google ScholarCross Ref
- [139] . 2010. Comparison of microaggregation approaches on anonymized data quality. Expert Systems with Applications 37, 12 (2010), 8161–8165.Google ScholarDigital Library
- [140] . 1993. Statistical analysis of masked data. Journal of Official Statistics 9, 2 (1993), 407.Google Scholar
- [141] . 2004. Statistical disclosure techniques based on multiple imputation. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin’s Statistical Family, Andrew Gelman and Xiao-Li Meng (Eds.). Wiley, 141–152.Google Scholar
- [142] . 2021. Generalization techniques empirically outperform differential privacy against membership inference. arXiv preprint arXiv:2110.05524 (2021). https://arxiv.org/abs/2110.05524.Google Scholar
- [143] . 2019. UHRP: Uncertainty-based pruning method for anonymized data linear regression. In Proceedings of the International Conference on Database Systems for Advanced Applications. 19–33.Google ScholarDigital Library
- [144] . 2020. A dynamic privacy protection mechanism for spatiotemporal crowdsourcing. Security and Communication Networks 2020 (2020), 1–14.Google ScholarDigital Library
- [145] . 2019. PPGAN: Privacy-preserving generative adversarial network. In Proceedings of the 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS’19). IEEE, Los Alamitos, CA, 985–989.Google ScholarCross Ref
- [146] . 2020. Towards distributed privacy-preserving prediction. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC’20). IEEE, Los Alamitos, CA, 4179–4184.Google ScholarDigital Library
- [147] . 2008. Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 277–286.Google ScholarDigital Library
- [148] . 2007. l-Diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 3–es.Google ScholarDigital Library
- [149] . 2016. The Anonymisation Decision-Making Framework. UKAN Publications.Google Scholar
- [150] . 2021. Anonymization techniques for privacy preserving data publishing: A comprehensive survey. IEEE Access 9 (2021), 8512–8545.Google ScholarCross Ref
- [151] . 2008. A recursive search algorithm for statistical disclosure assessment. Data Mining and Knowledge Discovery 16, 2 (2008), 165–196.Google ScholarDigital Library
- [152] . 2012. Semantic adaptive microaggregation of categorical microdata. Computers & Security 31, 5 (2012), 653–672.Google ScholarDigital Library
- [153] . 2004. Outlier protection in continuous microdata masking. In Proceedings of the International Workshop on Privacy in Statistical Databases. 201–215.Google ScholarCross Ref
- [154] . 2011. Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys 5 (2011), 1–29.Google ScholarCross Ref
- [155] . 2015. A review of attribute disclosure control. In Advanced Research in Data Privacy. Studies in Computational Intelligence, Vol. 567. Springer, 41–61.Google ScholarCross Ref
- [156] . 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 220–229.Google ScholarDigital Library
- [157] . 2013. Utilizing noise addition for data privacy, an overview. arXiv preprint arXiv:1309.3958 (2013).Google Scholar
- [158] . 2013. A comparative analysis of data privacy and utility parameter adjustment, using machine learning classification as a gauge. Procedia Computer Science 20 (2013), 414–419.Google ScholarCross Ref
- [159] . 2012. Towards a differential privacy and utility preserving machine learning classifier. Procedia Computer Science 12 (2012), 176–181.Google ScholarCross Ref
- [160] . 2010. Centralized and distributed anonymization for high-dimensional healthcare data. ACM Transactions on Knowledge Discovery from Data 4, 4 (2010), 1–33.Google ScholarDigital Library
- [161] . 1996. Controlled Data-Swapping Techniques for Masking Public Use Microdata Sets. U.S. Census Bureau.Google Scholar
- [162] . 2017. MOSTLY AI. Retrieved December 1, 2022 from https://mostly.ai/.Google Scholar
- [163] . 2020. Virtual Data Lab (VDL). Retrieved December 1, 2022 from https://github.com/mostly-ai/virtualdatalab.Google Scholar
- [164] . 2016. Rank-based record linkage for re-identification risk assessment. In Proceedings of the International Conference on Privacy in Statistical Databases. 225–236.Google ScholarCross Ref
- [165] . 2020. \(\epsilon\)-Differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In Proceedings of the International Conference on Privacy in Statistical Databases. 21–31.Google ScholarDigital Library
- [166] . 2003. A theoretical basis for perturbation methods. Statistics and Computing 13, 4 (2003), 329–335.Google ScholarDigital Library
- [167] . 2003. A rejoinder to the comments by Polettini and Stander. Statistics and Computing 13, 4 (2003), 339–342.Google ScholarDigital Library
- [168] . 2006. Data shuffling—A new masking approach for numerical data. Management Science 52, 5 (2006), 658–670.Google ScholarDigital Library
- [169] . 2006. Why swap when you can shuffle? A comparison of the proximity swap and data shuffle for numeric data. In Proceedings of the International Conference on Privacy in Statistical Databases. 164–176.Google ScholarDigital Library
- [170] . 2021. Privacy preserving techniques applied to CPNI data: Analysis and recommendations. arXiv preprint arXiv:2101.09834 (2021).Google Scholar
- [171] . 2021. Give more data, awareness and control to individual citizens, and they will help COVID-19 containment. Ethics and Information Technology 23, 1 (2021), 1–6.Google ScholarDigital Library
- [172] . 2008. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP’08). IEEE, Los Alamitos, CA, 111–125.Google ScholarDigital Library
- [173] . 2021. A fuzzy approach to identity resolution. In Proceedings of the International Conference on Engineering Applications of Neural Networks. 307–318.Google ScholarCross Ref
- [174] . 2011. Statistical properties of multiplicative noise masking for confidentiality protection. Journal of Official Statistics 27, 3 (2011), 527.Google Scholar
- [175] . 2007. Hiding the presence of individuals from shared databases. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 665–676.Google ScholarDigital Library
- [176] . 2007. Thoughts on k-anonymization. Data & Knowledge Engineering 63, 3 (2007), 622–645.Google ScholarDigital Library
- [177] . 2008. Rethinking rank swapping to decrease disclosure risk. Data & Knowledge Engineering 64, 1 (2008), 346–364.Google ScholarDigital Library
- [178] . 2015. Utility of synthetic microdata generated using tree-based methods. In Proceedings of the UNECE Statistical Data Confidentiality Work Session. 1–11.Google Scholar
- [179] . 2001. Reidentification of Individuals in Chicago’s Homicide Database: A Technical and Legal Study. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- [180] . 2009. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review 57 (2009), 1701.Google Scholar
- [181] . 2002. Effects of data anonymization by cell suppression on descriptive statistics and predictive modeling performance. Journal of the American Medical Informatics Association 9, Suppl. 6 (2002), 115–119.Google ScholarCross Ref
- [182] . 2010. Privacy preserving clustering by data transformation. Journal of Information and Data Management 1, 1 (2010), 37.Google Scholar
- [183] . 2021. Amnesia. Retrieved November 1, 2021 from https://amnesia.openaire.eu.Google Scholar
- [184] . 2019. Improving suppression to reduce disclosure risk and enhance data utility. arXiv preprint arXiv:1901.00716 (2019).Google Scholar
- [185] . 1999. Some Results of Individual Ranking Method on the System of Enterprise Accounts Annual Survey. Esprit SDC Project, Deliverable MI-3/S1. Esprit.Google Scholar
- [186] . 2016. The synthetic data vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA’16). 399–410. Google ScholarCross Ref
- [187] . 2020. Machine learning for COVID-19 needs global collaboration and data-sharing. Nature Machine Intelligence 2, 6 (2020), 293–294.Google ScholarCross Ref
- [188] . 2017. DataSynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 1–5.Google ScholarDigital Library
- [189] . 2020. Flexible data anonymization using ARX—Current status and challenges ahead. Software: Practice and Experience 50, 7 (2020), 1277–1304.Google ScholarCross Ref
- [190] . 2016. The importance of context: Risk-based de-identification of biomedical data. Methods of Information in Medicine 55, 4 (2016), 347–355.Google ScholarCross Ref
- [191] . 2020. Data mining and analysis of scientific research data records on Covid-19 mortality, immunity, and vaccine development—In the first wave of the Covid-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14, 5 (2020), 1121–1132.Google ScholarCross Ref
- [192] . 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846–850.Google ScholarDigital Library
- [193] . 2020. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Medical Informatics 8, 7 (2020), e18910.Google ScholarCross Ref
- [194] . 2005. Estimating risks of identification disclosure in microdata. Journal of the American Statistical Association 100, 472 (2005), 1103–1112.Google ScholarCross Ref
- [195] . 2005. Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 3 (2005), 441.Google Scholar
- [196] . 1979. Information Retrieval. Butterworth-Heinemann.Google ScholarDigital Library
- [197] . 2009. UK release practices for official microdata. Statistical Journal of the IAOS 26, 3, 4 (2009), 103–111.Google Scholar
- [198] . 2019. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications 10, 1 (2019), 1–9.Google ScholarCross Ref
- [199] . 2018. Discerning suicide in drug intoxication deaths: Paucity and primacy of suicide notes and psychiatric history. PLoS One 13, 1 (2018), e0190200.Google ScholarCross Ref
- [200] . 2017. Efficient anonymization algorithms to prevent generalized losses and membership disclosure in microdata. American Journal of Data Mining and Knowledge Discovery 2, 2 (2017), 54–61.Google Scholar
- [201] . 2020. Differentially private synthetic data: Applied evaluations and enhancements. arXiv preprint arXiv:2011.05537 (2020).Google Scholar
- [202] . 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1987), 53–65.Google ScholarDigital Library
- [203] . 1993. Discussion statistical disclosure limitation. Journal of Official Statistics 9, 2 (1993), 461.Google Scholar
- [204] . 2019. Towards a global data privacy standard. Florida Law Review 71 (2019), 365.Google Scholar
- [205] . 2019. Handbook on Statistical Disclosure Control for Outputs. Retrieved November 1, 2022 from https://ukdataservice.ac.uk/app/uploads/thf_datareport_aw_web.pdf.Google Scholar
- [206] . 2001. Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering 13, 6 (2001), 1010–1027.Google ScholarDigital Library
- [207] . 2020. ASENVA: Summarizing anatomy model by aggregating sensitive values. In Proceedings of the 2020 International Conference on Electrical Engineering and Informatics (ICELTICs’20). IEEE, Los Alamitos, CA, 1–4.Google Scholar
- [208] . 1998. Estimating the re-identification risk per record in microdata. Journal of Official Statistics 14, 4 (1998), 361.Google Scholar
- [209] . 1994. Disclosure control for census microdata. Journal of Official Statistics–Stockholm 10 (1994), 31.Google Scholar
- [210] . 2002. A measure of disclosure risk for microdata. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 4 (2002), 855–867.Google ScholarCross Ref
- [211] . 2014. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB Journal 23, 5 (2014), 771–794.Google ScholarDigital Library
- [212] . 2014. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. The VLDB Journal 23, 5 (2014), 771–794.Google ScholarDigital Library
- [213] . 1983. The confidentiality and analytic usefulness of masked business microdata. Proceedings of the Section on Survey Research Methods 1983 (1983), 602–607.Google Scholar
- [214] . 2014. \(\mu\)-ARGUS. Retrieved November 1, 2021 from https://github.com/sdcTools/muargus.Google Scholar
- [215] . 1989. The Use of Added Error to Avoid Disclosure in Microdata Releases. Ph. D. Dissertation. Iowa State University.Google ScholarDigital Library
- [216] . 2016. Anatomisation with slicing: A new privacy preservation approach for multiple sensitive attributes. SpringerPlus 5, 1 (2016), 1–21.Google ScholarCross Ref
- [217] . 2000. Simple demographics often identify people uniquely. Health (San Francisco) 671, 2000 (2000), 1–34.Google Scholar
- [218] . 2002. k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 557–570.Google ScholarDigital Library
- [219] . 1999. Local Recoding by Maximum Weight Matching for Disclosure Control of Microdata Sets. CIRJE F-Series CIRJE-F-40, CIRJE, Faculty of Economics, University of Tokyo.Google Scholar
- [220] . 1999. Some superpopulation models for estimating the number of population uniques. In Proceedings of the Conference on Statistical Data Protection. 45–58.Google Scholar
- [221] . 2009. Angel: Enhancing the utility of generalization for privacy preserving publication. IEEE Transactions on Knowledge and Data Engineering 21, 7 (2009), 1073–1087.Google ScholarDigital Library
- [222] . 2015. Statistical disclosure control for micro-data using the R package sdcMicro. Journal of Statistical Software 67, 4 (2015), 1–36.Google ScholarCross Ref
- [223] . 2008. Robust statistics meets SDC: New disclosure risk measures for continuous microdata masking. In Proceedings of the International Conference on Privacy in Statistical Databases. 177–189.Google ScholarDigital Library
- [224] . 1991. Optimal noise addition for preserving confidentiality in multivariate data. Journal of Statistical Planning and Inference 27, 3 (1991), 341–353.Google ScholarCross Ref
- [225] . 2004. Microaggregation for categorical variables: A median based approach. In Proceedings of the International Workshop on Privacy in Statistical Databases. 162–174.Google ScholarCross Ref
- [226] . 2017. Privacy models and disclosure risk measures. In Data Privacy: Foundations, New Developments and the Big Data Challenge. Springer, 111–189.Google Scholar
- [227] . 2022. Guide to Data Privacy: Models, Technologies, Solutions. Springer Nature.Google Scholar
- [228] . 2006. Using Mahalanobis distance-based record linkage for disclosure risk assessment. In Proceedings of the International Conference on Privacy in Statistical Databases. 233–242.Google ScholarDigital Library
- [229] . 2006. Global disclosure risk for microdata with continuous attributes. In Privacy and Technologies of Identity. Springer, 349–363.Google ScholarCross Ref
- [230] . 2006. Privacy protection: P-sensitive k-anonymity property. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, Los Alamitos, CA, 94.Google ScholarDigital Library
- [231] . 2012. UTD Anonymisation ToolBox. http://cs.utdallas.edu/dspl/cgi-bin/toolbox/.
Accessed Nov 2021 .Google Scholar - [232] . 2004. Privacy-preserving outlier detection. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04). IEEE, Los Alamitos, CA, 233–240.Google ScholarDigital Library
- [233] . 2018. An evaluation of anonymized models and ensemble classifiers. In Proceedings of the 2018 2nd International Conference on Big Data and Internet of Things. 18–22.Google ScholarDigital Library
- [234] . 2018. Technical privacy metrics: A systematic survey. ACM Computing Surveys 51, 3 (2018), 1–38.Google ScholarDigital Library
- [235] . 2006. Anonymizing sequential releases. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 414–423.Google ScholarDigital Library
- [236] . 2010. Anonymizing temporal data. In Proceedings of the 2010 IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 1109–1114.Google ScholarDigital Library
- [237] . 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, Los Alamitos, CA, 249–256.Google ScholarCross Ref
- [238] . 2008. A new evaluation measure for imbalanced datasets. In Proceedings of the 7th Australasian Data Mining Conference, Vol. 87 27–32.Google ScholarDigital Library
- [239] . 1996. Statistical Disclosure Control in Practice. Vol. 111. Springer Science & Business Media.Google ScholarCross Ref
- [240] . 2000. Elements of Statistical Disclosure Control. Lecture Notes in Statistics, Vol. 144. Springer.Google Scholar
- [241] . 2003. Protecting data through perturbation techniques: The impact on knowledge discovery in databases. Journal of Database Management 14, 2 (2003), 14–26.Google ScholarCross Ref
- [242] . 2006. (\(\alpha\), k)-Anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 754–759.Google ScholarDigital Library
- [243] . 2006. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd International Conference on Very Large Data Bases. 139–150.Google ScholarDigital Library
- [244] . 2007. M-invariance: towards privacy preserving re-publication of dynamic datasets. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 689–700.Google ScholarDigital Library
- [245] . 2018. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018).Google Scholar
- [246] . 2006. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–790.Google ScholarDigital Library
- [247] . 2019. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems 32.Google Scholar
- [248] . 2020. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416 (2020), 244–255.Google ScholarCross Ref
- [249] . 2019. YData. Retrieved December 1, 2022 from https://ydata.ai/.Google Scholar
- [250] . 2021. YData Synthetic. Retrieved December 1, 2022 from https://github.com/ydataai/ydata-synthetic.Google Scholar
- [251] . 2017. An anonymization method combining anatomy and permutation for protecting privacy in microdata with multiple sensitive attributes. In Proceedings of the 2017 International Conference on Machine Learning and Cybernetics (ICMLC’17), Vol. 2. IEEE, Los Alamitos, CA, 404–411.Google ScholarCross Ref
- [252] . 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In Proceedings of the 2018 IEEE 31st Computer Security Foundations Symposium (CSF’18). IEEE, Los Alamitos, CA, 268–282.Google ScholarCross Ref
- [253] . 2007. Aggregate query answering on anonymized tables. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering. 116–125.Google ScholarCross Ref
- [254] . 2021. On the (in) feasibility of attribute inference attacks on machine learning models. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy (EuroS&P’21). IEEE, Los Alamitos, CA, 232–251.Google ScholarCross Ref
- [255] . 2017. Research progress of anonymous data release. In Proceedings of the 2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS’17). IEEE, Los Alamitos, CA, 226–230.Google ScholarCross Ref
- [256] . 2020. A survey on privacy properties for data publishing of relational data. IEEE Access 8 (2020), 51071–51099.Google ScholarCross Ref
- [257] . 2020. Privacy preserving classification over differentially private data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. Early access, December 13, 2020.Google Scholar
Index Terms
- Survey on Privacy-Preserving Techniques for Microdata Publication
Recommendations
Personalised anonymity for microdata release
Individual privacy protection in the released data sets has become an important issue in recent years. The release of microdata provides a significant information resource for researchers, whereas the release of person‐specific data poses a threat to ...
Measuring privacy in high dimensional microdata collections
ARES '17: Proceedings of the 12th International Conference on Availability, Reliability and SecurityMicrodata is collected by companies in order to enhance their quality of service as well as the accuracy of their recommendation systems. These data often become publicly available after they have been sanitized. Recent reidentification attacks on ...
Privacy and confidentiality management for the microaggregation disclosure control method: disclosure risk and information loss measures
WPES '03: Proceedings of the 2003 ACM workshop on Privacy in the electronic societyIn this paper, we first introduce minimal, maximal and weighted disclosure risk measures for microaggregation disclosure control method. Our disclosure risk measures are more applicable to real-life situations, compute the overall disclosure risk, and ...
Comments