A sanitization approach for big data with improved data utility

Sharma, Udit; Toshniwal, Durga; Sharma, Shivani

doi:10.1007/s10489-020-01640-4

A sanitization approach for big data with improved data utility

Published: 25 February 2020

Volume 50, pages 2025–2039, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

394 Accesses
6 Citations
Explore all metrics

Abstract

The process of collaborative data mining may sometimes expose the sensitive patterns present inside the data which may be undesirable to the data owner. Sensitive Pattern Hiding (SPH) is a subfield of data mining that addresses this problem. However, most of the existing approaches used for hiding sensitive patterns cause high side-effect on non-sensitive patterns which in-turn reduces the utility of the sanitized dataset. Furthermore, most of them are sequential in nature and are not able to cope with massive amounts of data and often results in high execution time. To resolve these identified challenges of utility and non-feasibility, two parallelized approaches have been proposed named PGVIR and PHCR based on spark parallel computing framework which modifies the data such that no sensitive patterns can be extracted while maintaining the utility of the sanitized dataset. Experiments performed using benchmark dataset shows that PGVIR scales better and PHCR causes fewer side-effects to the data compared to the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

Priyank Jain, Manasi Gyanchandani & Nilay Khare

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

References

Amiri A (2007) Dare to share: protecting sensitive knowledge with data sanitization. Decis Support Syst 43 (1):181–191
Google Scholar
Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999) Disclosure limitation of sensitive rules. In: (KDEX’99) Proceedings 1999 workshop on knowledge and data engineering exchange 1999. IEEE, pp 45–52
Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using confidence and support. In: International workshop on information hiding. Springer, pp 369–383
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling of high-frequency accident locations by use of association rules. Transp Res Record: J Transp Res Board 1840:123–130
Google Scholar
GkoulalasDivanis A, Verykios VS (2006) An integer programming approach for frequent itemset hiding. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 748–757
Lee G, Chang C-Y, Chen ALP (2004) Hiding sensitive patterns in association rules mining. In: 2004 Proceedings of the 28th annual international, computer software and applications conference, 2004. COMPSAC. IEEE, pp 424–429
Menon S, Sarkar S, Mukherjee S (2005) Maximizing accuracy of shared databases when concealing sensitive patterns, vol 16
Moustakides GV, Verykios VS (2008) A maxmin approach for hiding frequent itemsets. Data Knowl Eng 65(1):75–89
Google Scholar
https://spark.apache.org/docs/2.3.0/
Oliveira SRM, Zaiane OR (2002) Privacy preserving frequent itemset mining. In: Proceedings of the IEEE international conference on privacy, security and data mining, vol 14. Australian Computer Society Inc., pp 43–54
Oliveira SRM, Zaiane OR (2003) Protecting sensitive knowledge by data sanitization.. In: Third IEEE International conference on data mining, 2003. ICDM 2003. IEEE, pp 613–616
Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems
Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy. ACM, New York, pp 195–6
Sun X, Yu PS (2005) A borderbased approach for hiding sensitive frequent itemsets. In: Fifth IEEE international conference on data mining. IEEE, pp 8–
Shivani S, Toshniwal D (2017) Scalable two-phase co-occurring sensitive pattern hiding using MapReduce. J Big Data 4(1):4
Google Scholar
Sharma S, Toshniwal D (2015) Parallelization of association rule mining: survey. In: 2015 International conference on computing, communication and security (ICCCS), Pamplemousses, pp 1–6
Zhang X, et al. (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25.2:363–373
Google Scholar
Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2114) Parallel processing systems for big data: a survey. Proc IEEE 104(11):2016
Google Scholar
Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering (ICDE’05). IEEE, New York
Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems
Han Z, Zhang Y (2015) Spark: a big data processing platform based on memory computing. In: 2015 Seventh international symposium on parallel architectures, algorithms and programming (PAAP), Nanjing, pp 172–176
Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, pp 195–206
Hong TP, Lin CW, Yang KT, Wang SL (2011) A heuristic data-sanitization approach based on TF-IDF. In: International conference on industrial engineering and other applications of applied intelligent systems, pp 156–164
Cheng P, Roddick JF, Chu SC, Lin CW (2016) Privacy preservation through a Greedy, distortion-based rule-hiding method. Appl Intell 44(2):295–306
Google Scholar
Telikani A, Shahbahrami A, Tavoli R (2015) Data sanitization in association rule mining based on impact factor. J AI Data Min 3(2):131–140
Google Scholar
https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Uttrakhand, 247667, India
Udit Sharma, Durga Toshniwal & Shivani Sharma

Authors

Udit Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Durga Toshniwal
View author publications
You can also search for this author in PubMed Google Scholar
Shivani Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shivani Sharma.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Comparison with existing parallel SPH techniques

Another set of the experiment has been performed to judge the performance of PGVIR and PHCR with respect to the parallel version of MaxFIA and SWA schemes proposed in [10, 11]. The first experiment has been set up with varying data sizes. Figure 12a plots the execution time taken by the sanitization process with varying data sizes. The MST is set to 20% and the total sensitive patterns need to be masked is 50. It can be clearly observed that due to both ways parallelization i.e. data parallelization and computing parallelization achieved in PGVIR and PHCR, the proposed scheme performs better than the existing state of art. Further, proposed schemes are implemented using the Spark platform and Parallel MaxFIA and SWA have been implemented using Hadoop MapReduce which again is the reason for the clear difference. Spark [26] platform is faster than the Hadoop MapReduce due to several reason like in-memory computation, data frame creation etc. Further the initilization time of hadoop is much higher than the Spark.

Second Set of experiment have been performed to analyze the performance in terms of running time with varying minimum support threshold value. Figure 12b presents the plot between running time and varying MST. It can be observed that with different minimum threshold value the execution time of PGVIR and PHCR is considerably less than the parallel MaxFIA and SWA. Therefore, it can be stated that due to both ways parallelization and use of the Spark platform make proposed PGVIR and PHCR a better choice for preserving the privacy of sensitive data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, U., Toshniwal, D. & Sharma, S. A sanitization approach for big data with improved data utility. Appl Intell 50, 2025–2039 (2020). https://doi.org/10.1007/s10489-020-01640-4

Download citation

Published: 25 February 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s10489-020-01640-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sanitization approach for big data with improved data utility

Abstract

Access this article

Similar content being viewed by others