Skip to main content

A Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems L

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 12930))

Abstract

Quasi-identifiers (QIDs) are attribute combinations that can be used to discover hidden personal identifying information from an anonymised dataset. Typically, the information drawn from such QIDs can then be combined with more publicly accessible datasets to discover sensitive information (e.g. medical conditions, financial status, criminal history, ...). Research on data anonymisation has therefore proposed various algorithms to discover and transform quasi-identifiers efficiently to prevent re-identification. However, all existing algorithms are inefficient and fail to prevent re-identification attacks on large real-world high dimensional datasets successfully. This paper presents a quasi-identifier discovery algorithm that combines parallelism with an efficient search technique to find all minimal quasi-identifiers in a given dataset. As a further step, we present an adversary model based on the enumeration problem of discovering unique column combinations in a dataset. We demonstrate that our quasi-identifier discovery algorithm is secure to re-identification attacks based on this adversarial model, even in the presence of large high-dimensional datasets that change dynamically. Our empirical results show that our algorithm not only scales well to large high-dimensional datasets but exploits its parallelisability on GPU (Graphics Processing Unit) architectures to prevent re-identification even in the presence of a powerful adversary equipped with similar high-performance computing processing power. Furthermore, our results show that the proposed GPU algorithm offers up to 100x times speedup over the algorithm’s CPU version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/rapidsai/cudf.

References

  1. Abedjan, Z., Naumann, F.: Advancing the discovery of unique column combinations. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1565–1570 (2011)

    Google Scholar 

  2. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-y

    Article  Google Scholar 

  3. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data profiling. Synth. Lect. Data Manage. 10(4), 1–154 (2018)

    Article  Google Scholar 

  4. Aggarwal, G., et al.: Anonymizing tables. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 246–258. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_17

    Chapter  Google Scholar 

  5. Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting set enumeration with partial information for unique column combination discovery. In: Proceedings of the VLDB Endowment vol. 13, no. 11, pp. 2270–2283 (2020)

    Google Scholar 

  6. Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) 11th International Symposium on Parameterized and Exact Computation (IPEC 2016), volume 63 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 6:1–6:13, Dagstuhl, Germany. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ISBN: 978-3-95977-023-1 (2017). https://doi.org/10.4230/LIPIcs.IPEC.2016.6, http://drops.dagstuhl.de/opus/volltexte/2017/6920

  7. Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)

    Google Scholar 

  8. Braghin, S., Gkoulalas-Divanis, A., Wurst, M.: Detecting quasi-identifiers in datasets. US Patent 9,870,381, 16 January 2018

    Google Scholar 

  9. Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.-D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration 69, 335–344. ISSN: 0167-9260 (2019). https://doi.org/10.1016/j.vlsi.2019.07.003, http://www.sciencedirect.com/science/article/pii/S0167926019301348

  10. Heer, D., Podlesny, J.: Process for the user-related answering of customer inquiries in data networks. US Patent 10,033,705, 24 July 2018

    Google Scholar 

  11. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1

    Chapter  MATH  Google Scholar 

  12. Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5_752

  13. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)

    Google Scholar 

  14. Gutmann, A., et al.: Privacy and progress in whole genome sequencing. Presidential Committee for the Study of Bioethical (2012)

    Google Scholar 

  15. Hamza, N., Hefny, H.A., et al.: Attacks on anonymization-based privacy-preserving: a survey for data mining and data publishing (2013)

    Google Scholar 

  16. Han, S., Cai, X., Wang, C., Zhang, H., Wen, Y.: Discovery of unique column combinations with hadoop. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 533–541. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11116-2_49

    Chapter  Google Scholar 

  17. Heise, A., Quiané-Ruiz, J.-A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. Proc. VLDB Endowment 7(4), 301–312 (2013)

    Article  Google Scholar 

  18. Ilavarasi, A.K., Sathiyabhama, B., Poorani, S.: A survey on privacy preserving data mining techniques. Int. J. Comput. Sci. Bus. Inform. 7(1) (2013)

    Google Scholar 

  19. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. IRSS, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9

    Chapter  Google Scholar 

  20. Kavitha, S., Yamini, S., et al.: An evaluation on big data generalization using k-anonymity algorithm on cloud. In: 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO), pp. 1–5. IEEE (2015)

    Google Scholar 

  21. Kushida, C.A., Nichols, D.A., Jadrnicek, R., Miller, R., Walsh, J.K., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, S82–S101 (2012)

    Article  Google Scholar 

  22. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd ICDE, pp. 106–115, April 2007. https://doi.org/10.1109/ICDE.2007.367856

  23. Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 236–241. IEEE (2012)

    Google Scholar 

  24. Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)

    Article  Google Scholar 

  25. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)

    Article  Google Scholar 

  26. Motwani, R., Xu, Y.: Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the Conference on Very Large Data Bases (VLDB), pp. 83–93 (2007)

    Google Scholar 

  27. Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)

    Article  Google Scholar 

  28. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)

    Article  Google Scholar 

  29. Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Technologie und Web (BTW), Datenbanksysteme für Business, p. 2017 (2017)

    Google Scholar 

  30. Papenbrock, T., et al.: Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endowment 8(10), 1082–1093 (2015)

    Article  Google Scholar 

  31. Podlesny, N.J.: Semi-synthetic genome data (2020). https://github.com/jaSunny/synthetic_genome_data

  32. Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M.: Minimising information loss on anonymised high dimensional data with greedy in-memory processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11029, pp. 85–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98809-2_6

    Chapter  Google Scholar 

  33. Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable. Autonomic and Secure Computing (DASC), p. 2019. IEEE (2019)

    Google Scholar 

  34. Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)

    Google Scholar 

  35. Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11706, pp. 33–43. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27615-7_3

    Chapter  Google Scholar 

  36. Podlesny, N.J., Kayem, A.V.D.M., Meinel, C.: How data anonymisation techniques influence disease triage in digital health: a study on base rate neglect. In: Proceedings of the 2019 International Conference on Digital Health. ACM (2019)

    Google Scholar 

  37. Podlesny, N.J.: High-dimensional data anonymization for in-memory applications. US Patent 10,747,901, 18 August 2020

    Google Scholar 

  38. Polat, H., Du, W.: Privacy-preserving collaborative filtering using randomized perturbation techniques. In Third IEEE International Conference on Data Mining. ICDM 2003, pp. 625–628. IEEE (2003)

    Google Scholar 

  39. Presswala, F., Thakkar, A., Bhatt, N.: Survey on anonymization in privacy preserving data mining (2015)

    Google Scholar 

  40. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, Boston (2010)

    Google Scholar 

  41. Sopaoglu, U., Abul, O.: A top-down k-anonymization implementation for apache spark. In 2017 IEEE International Conference On Big Data (Big Data), pp. 4513–4521. IEEE (2017)

    Google Scholar 

  42. Sowmya, Y., Nagaratna, M.: Parallelizing k-anonymity algorithm for privacy preserving knowledge discovery from big data. Int. J. Appl. Eng. Res. 11(2), 1314–1321 (2016)

    Google Scholar 

  43. Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Working Paper 3, Carnegie Mellon University, USA (2000). https://projects.iq.harvard.edu/files/privacytools/files/paper1.pdf

  44. Sweeney, L.: Uniqueness of simple demographics in the us population. LIDAP-WP4 (2000)

    Google Scholar 

  45. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(05), 571–588 (2002)

    Article  MathSciNet  Google Scholar 

  46. Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 543–554. VLDB Endowment. ISBN: 978-1-59-593649-3 (2007)

    Google Scholar 

  47. Wong, R.C.-W., Fu, A.W.-C., Wang, K., Pei, J.: Anonymization-based attacks in privacy-preserving data publishing. ACM Trans. Database Syst. 34(2). ISSN: 0362-5915 (2009). https://doi.org/10.1145/1538909.1538910

  48. Wong, R.C.-W., Fu, A.W.-C., Wang, K., Yu, P.S., Pei, J.: Can the utility of anonymized data be used for privacy breaches? ACM Trans. Knowl. Discov. Data 5(3). ISSN: 1556-4681 (2011). https://doi.org/10.1145/1993077.1993080

  49. Zare-Mirakabad, M.-R., Jantan, A., Bressan, S.: Privacy risk diagnosis: mining l-Diversity. In: Chen, L., Liu, C., Liu, Q., Deng, K. (eds.) DASFAA 2009. LNCS, vol. 5667, pp. 216–230. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04205-8_19

    Chapter  Google Scholar 

  50. Zhang, B., Dave, V., Mohammed, N., Al Hasan, M.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)

  51. Zhang, X., Qi, L., He, Q., Dou, W.: Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In: Wang, G., Ray, I., Alcaraz Calero, J.M., Thampi, S.M. (eds.) SpaCCS 2016. LNCS, vol. 10067, pp. 311–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49145-5_31

    Chapter  Google Scholar 

  52. Zimmermann, T., et al.: Detecting fraudulent advertisements on a large e-commerce platform. In: EDBT/ICDT Workshops (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Nikolai J. Podlesny or Anne V. D. M. Kayem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). A Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems L. Lecture Notes in Computer Science(), vol 12930. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-64553-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-64553-6_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-64552-9

  • Online ISBN: 978-3-662-64553-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics