A Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation

Podlesny, Nikolai J.; Kayem, Anne V. D. M.; Meinel, Christoph

doi:10.1007/978-3-662-64553-6_1

Nikolai J. Podlesny¹⁰,
Anne V. D. M. Kayem¹⁰ &
Christoph Meinel¹⁰

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 12930))

307 Accesses
2 Citations

Abstract

Quasi-identifiers (QIDs) are attribute combinations that can be used to discover hidden personal identifying information from an anonymised dataset. Typically, the information drawn from such QIDs can then be combined with more publicly accessible datasets to discover sensitive information (e.g. medical conditions, financial status, criminal history, ...). Research on data anonymisation has therefore proposed various algorithms to discover and transform quasi-identifiers efficiently to prevent re-identification. However, all existing algorithms are inefficient and fail to prevent re-identification attacks on large real-world high dimensional datasets successfully. This paper presents a quasi-identifier discovery algorithm that combines parallelism with an efficient search technique to find all minimal quasi-identifiers in a given dataset. As a further step, we present an adversary model based on the enumeration problem of discovering unique column combinations in a dataset. We demonstrate that our quasi-identifier discovery algorithm is secure to re-identification attacks based on this adversarial model, even in the presence of large high-dimensional datasets that change dynamically. Our empirical results show that our algorithm not only scales well to large high-dimensional datasets but exploits its parallelisability on GPU (Graphics Processing Unit) architectures to prevent re-identification even in the presence of a powerful adversary equipped with similar high-performance computing processing power. Furthermore, our results show that the proposed GPU algorithm offers up to 100x times speedup over the algorithm’s CPU version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/rapidsai/cudf.

References

Abedjan, Z., Naumann, F.: Advancing the discovery of unique column combinations. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1565–1570 (2011)
Google Scholar
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-y
Article Google Scholar
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data profiling. Synth. Lect. Data Manage. 10(4), 1–154 (2018)
Article Google Scholar
Aggarwal, G., et al.: Anonymizing tables. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 246–258. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_17
Chapter Google Scholar
Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. In: Proceedings of the VLDB Endowment vol. 13, no. 11, pp. 2270–2283 (2020)
Google Scholar
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), volume 63 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 6:1–6:13, Dagstuhl, Germany. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ISBN: 978-3-95977-023-1 (2017). https://doi.org/10.4230/LIPIcs.IPEC.2016.6, http://drops.dagstuhl.de/opus/volltexte/2017/6920
Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)
Google Scholar
Braghin, S., Gkoulalas-Divanis, A., Wurst, M.: Detecting quasi-identifiers in datasets. US Patent 9,870,381, 16 January 2018
Google Scholar
Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.-D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration 69, 335–344. ISSN: 0167-9260 (2019). https://doi.org/10.1016/j.vlsi.2019.07.003, http://www.sciencedirect.com/science/article/pii/S0167926019301348
Heer, D., Podlesny, J.: Process for the user-related answering of customer inquiries in data networks. US Patent 10,033,705, 24 July 2018
Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter MATH Google Scholar
Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5_752
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)
Google Scholar
Gutmann, A., et al.: Privacy and progress in whole genome sequencing. Presidential Committee for the Study of Bioethical (2012)
Google Scholar
Hamza, N., Hefny, H.A., et al.: Attacks on anonymization-based privacy-preserving: a survey for data mining and data publishing (2013)
Google Scholar
Han, S., Cai, X., Wang, C., Zhang, H., Wen, Y.: Discovery of unique column combinations with hadoop. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 533–541. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11116-2_49
Chapter Google Scholar
Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. Proc. VLDB Endowment 7(4), 301–312 (2013)
Article Google Scholar
Ilavarasi, A.K., Sathiyabhama, B., Poorani, S.: A survey on privacy preserving data mining techniques. Int. J. Comput. Sci. Bus. Inform. 7(1) (2013)
Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. IRSS, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Chapter Google Scholar
Kavitha, S., Yamini, S., et al.: An evaluation on big data generalization using k-anonymity algorithm on cloud. In: 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO), pp. 1–5. IEEE (2015)
Google Scholar
Kushida, C.A., Nichols, D.A., Jadrnicek, R., Miller, R., Walsh, J.K., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, S82–S101 (2012)
Article Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd ICDE, pp. 106–115, April 2007. https://doi.org/10.1109/ICDE.2007.367856
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 236–241. IEEE (2012)
Google Scholar
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Article Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)
Article Google Scholar
Motwani, R., Xu, Y.: Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the Conference on Very Large Data Bases (VLDB), pp. 83–93 (2007)
Google Scholar
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)
Article Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Article Google Scholar
Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Technologie und Web (BTW), Datenbanksysteme für Business, p. 2017 (2017)
Google Scholar
Papenbrock, T., et al.: Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endowment 8(10), 1082–1093 (2015)
Article Google Scholar
Podlesny, N.J.: Semi-synthetic genome data (2020). https://github.com/jaSunny/synthetic_genome_data
Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M.: Minimising information loss on anonymised high dimensional data with greedy in-memory processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11029, pp. 85–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98809-2_6
Chapter Google Scholar
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable. Autonomic and Secure Computing (DASC), p. 2019. IEEE (2019)
Google Scholar
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)
Google Scholar
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11706, pp. 33–43. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27615-7_3
Chapter Google Scholar
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: How data anonymisation techniques influence disease triage in digital health: a study on base rate neglect. In: Proceedings of the 2019 International Conference on Digital Health. ACM (2019)
Google Scholar
Podlesny, N.J.: High-dimensional data anonymization for in-memory applications. US Patent 10,747,901, 18 August 2020
Google Scholar
Polat, H., Du, W.: Privacy-preserving collaborative filtering using randomized perturbation techniques. In Third IEEE International Conference on Data Mining. ICDM 2003, pp. 625–628. IEEE (2003)
Google Scholar
Presswala, F., Thakkar, A., Bhatt, N.: Survey on anonymization in privacy preserving data mining (2015)
Google Scholar
Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, Boston (2010)
Google Scholar
Sopaoglu, U., Abul, O.: A top-down k-anonymization implementation for apache spark. In 2017 IEEE International Conference On Big Data (Big Data), pp. 4513–4521. IEEE (2017)
Google Scholar
Sowmya, Y., Nagaratna, M.: Parallelizing k-anonymity algorithm for privacy preserving knowledge discovery from big data. Int. J. Appl. Eng. Res. 11(2), 1314–1321 (2016)
Google Scholar
Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Working Paper 3, Carnegie Mellon University, USA (2000). https://projects.iq.harvard.edu/files/privacytools/files/paper1.pdf
Sweeney, L.: Uniqueness of simple demographics in the us population. LIDAP-WP4 (2000)
Google Scholar
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(05), 571–588 (2002)
Article MathSciNet Google Scholar
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 543–554. VLDB Endowment. ISBN: 978-1-59-593649-3 (2007)
Google Scholar
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Anonymization-based attacks in privacy-preserving data publishing. ACM Trans. Database Syst. 34(2). ISSN: 0362-5915 (2009). https://doi.org/10.1145/1538909.1538910
Wong, R.C.-W., Fu, A.W.-C., Wang, K., Yu, P.S., Pei, J.: Can the utility of anonymized data be used for privacy breaches? ACM Trans. Knowl. Discov. Data 5(3). ISSN: 1556-4681 (2011). https://doi.org/10.1145/1993077.1993080
Zare-Mirakabad, M.-R., Jantan, A., Bressan, S.: Privacy risk diagnosis: mining l-Diversity. In: Chen, L., Liu, C., Liu, Q., Deng, K. (eds.) DASFAA 2009. LNCS, vol. 5667, pp. 216–230. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04205-8_19
Chapter Google Scholar
Zhang, B., Dave, V., Mohammed, N., Al Hasan, M.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)
Zhang, X., Qi, L., He, Q., Dou, W.: Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In: Wang, G., Ray, I., Alcaraz Calero, J.M., Thampi, S.M. (eds.) SpaCCS 2016. LNCS, vol. 10067, pp. 311–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49145-5_31
Chapter Google Scholar
Zimmermann, T., et al.: Detecting fraudulent advertisements on a large e-commerce platform. In: EDBT/ICDT Workshops (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute at the University of Potsdam, Potsdam, Germany
Nikolai J. Podlesny, Anne V. D. M. Kayem & Christoph Meinel

Authors

Nikolai J. Podlesny
View author publications
You can also search for this author in PubMed Google Scholar
Anne V. D. M. Kayem
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nikolai J. Podlesny or Anne V. D. M. Kayem .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
IFS, Technical University of Vienna, Vienna, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). A Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems L. Lecture Notes in Computer Science(), vol 12930. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-64553-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-64553-6_1
Published: 01 January 2022
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-64552-9
Online ISBN: 978-3-662-64553-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics