Skip to main content

GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 226))

Abstract

Determining unique attribute combinations as quasi-identifiers is a common starting point for both re-identification attacks and data anonymisation schemes. The efficient discovery of those quasi-identifiers (QIDs) has been a combinatoric nightmare, actually an enumeration problem [1,2,3] given its W2-complete nature [4,5,6]. Proper privacy guarantees are required to fulfil highest ethical standards and privacy legislation like CCPA or GDPR, yet also enable the most modern data-driven business model based on monetising corporate data pools. In this work, we offer three main contributions: First, we contribute an algorithm that vectorises the QID search. This QID discovery is based on Bayesian inference detection, which usually suffers a state-space explosion for large-scale datasets. By utilising GPU acceleration to execute the vectorised algorithm, we counter the state-space-explosion issue raised by Bayesian networks. Second, we show its applicability to anonymising high-dimensional data which suffers high information-loss when using standard anonymisation approaches. Third, we offer an empirical model that compares multiple optimisations to discover all QIDs in near real-time, even in large-scale datasets. The latter becomes extremely useful for instances in digital health settings where algorithmic execution time can influence life-and-death triage. Finally, we point out that the same approach can foster de-anonymisation attacks on already published datasets. A demonstration is enclosed to re-identify individuals from Mount Vernon, NY and Southern California in a published Twitter dataset on US Presidential Election 2020.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/rapidsai/cudf.

  2. 2.

    https://rapids.ai/.

  3. 3.

    https://pandas.pydata.org/.

  4. 4.

    https://developer.nvidia.com/cuda-zone.

  5. 5.

    https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html.

  6. 6.

    https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html.

References

  1. Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro. 30(2), 56–69 (2010)

    Google Scholar 

  2. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)

    Google Scholar 

  3. Cook, C., Zhao, H., Sato, T., Hiromoto, M., Tan, S.X.D.: GPU-based ising computing for solving max-cut combinatorial optimization problems. Integration, 69, 335–344 (2019)

    Google Scholar 

  4. Podlesny, N.J., Kayem, A.V., Meinel, C.: Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 109–119. ACM (2019)

    Google Scholar 

  5. Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., Schirneck, M.: Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In: Algorithm Engineering and Experiments (ALENEX), pp. 130–143 (2019)

    Google Scholar 

  6. Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) International Symposium on Parameterized and Exact Computation (IPEC), Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, vol. 63, pp. 6:1–6:13 (2016). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik

    Google Scholar 

  7. Barth-Jones, D.: The’re-identification’of governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (July 2012) (2012)

    Google Scholar 

  8. Price, W.N., Cohen, I.G.: Privacy in the age of medical big data. Nature Med. 25(1), 37–43 (2019)

    Google Scholar 

  9. Zhu, L., Jin, H., Zheng, R., Feng, X.: Effective Naive Bayes nearest neighbor based image classification on GPU. J. Supercomput. 68(2), 820–848 (2014)

    Article  Google Scholar 

  10. Viegas, F., Gonçalves, M.A., Martins, W., Rocha, L.: Parallel lazy semi-Naive Bayes strategies for effective and efficient document classification. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1071–1080 (2015)

    Google Scholar 

  11. Andrade, G., Viegas, F., Ramos, G.S., Almeida, J., Rocha, L., Gonçalves, M., Ferreira, R.: GPU-NB: a fast CUDA-based implementation of Naive Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 168–175. IEEE (2013)

    Google Scholar 

  12. Chen, F.C., Jahanshahi, M.R.: NB-CNN: deep learning-based crack detection using convolutional neural network and Naïve Bayes data fusion. IEEE Trans. Ind. Electron. 65(5), 4392–4400 (2017)

    Google Scholar 

  13. Gruber, L., et al.: GPU-accelerated Bayesian learning and forecasting in simultaneous graphical dynamic linear models. Bayesian Anal. 11(1), 125–149 (2016)

    Article  MathSciNet  Google Scholar 

  14. Ng, W.S., Kirchberg, M., Bressan, S., Tan, K.L.: Towards a privacy-aware stream data management system for cloud applications. Int. J. Web Grid Serv. 7(3), 246–267 (2011)

    Google Scholar 

  15. Kalidoss, T., Sannasi, G., Lakshmanan, S., Kanagasabai, K., Kannan, A.: Data anonymisation of vertically partitioned data using map reduce techniques on cloud. Int. J. Commun. Netw. Distrib. Syst. 20(4), 519–531 (2018)

    Google Scholar 

  16. Solanki, P., Garg, S., Chhinkaniwala, H.: Heuristic-based hybrid privacy-preserving data stream mining approach using SD-perturbation and multi-iterative k-anonymisation. Int. J. Knowl. Eng. Data Min. 5(4), 306–332 (2018)

    Article  Google Scholar 

  17. Podlesny, N.J., Kayem, A.V., Meinel, C.: Towards identifying de-anonymisation risks in distributed health data silos. In: International Conference on Database and Expert Systems Applications, pp. 33–43. Springer (2019)

    Google Scholar 

  18. Podlesny, N.J., Kayem, A.V., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: IEEE 17th International Conference on Dependable, Autonomic and Secure Computing, DASC 2019. IEEE (2019)

    Google Scholar 

  19. Nayahi, J.J.V., Kavitha, V.: Privacy and utility preserving data clustering for data anonymization and distribution on hadoop. Future Gener. Comput. Syst. 74, 393–408 (2017)

    Google Scholar 

  20. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters (2004)

    Google Scholar 

  21. Podlesny, N.J.: Synthetic genome data (2021)

    Google Scholar 

  22. IBRAHIM SABUNCU. USA Nov.2020 election 20 mil. tweets (with sentiment and party name labels) dataset (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolai J. Podlesny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2021). GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-030-75075-6_40

Download citation

Publish with us

Policies and ethics