Skip to main content
Log in

A sanitization approach for big data with improved data utility

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The process of collaborative data mining may sometimes expose the sensitive patterns present inside the data which may be undesirable to the data owner. Sensitive Pattern Hiding (SPH) is a subfield of data mining that addresses this problem. However, most of the existing approaches used for hiding sensitive patterns cause high side-effect on non-sensitive patterns which in-turn reduces the utility of the sanitized dataset. Furthermore, most of them are sequential in nature and are not able to cope with massive amounts of data and often results in high execution time. To resolve these identified challenges of utility and non-feasibility, two parallelized approaches have been proposed named PGVIR and PHCR based on spark parallel computing framework which modifies the data such that no sensitive patterns can be extracted while maintaining the utility of the sanitized dataset. Experiments performed using benchmark dataset shows that PGVIR scales better and PHCR causes fewer side-effects to the data compared to the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Amiri A (2007) Dare to share: protecting sensitive knowledge with data sanitization. Decis Support Syst 43 (1):181–191

    Google Scholar 

  2. Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999) Disclosure limitation of sensitive rules. In: (KDEX’99) Proceedings 1999 workshop on knowledge and data engineering exchange 1999. IEEE, pp 45–52

  3. Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using confidence and support. In: International workshop on information hiding. Springer, pp 369–383

  4. Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling of high-frequency accident locations by use of association rules. Transp Res Record: J Transp Res Board 1840:123–130

    Google Scholar 

  5. GkoulalasDivanis A, Verykios VS (2006) An integer programming approach for frequent itemset hiding. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 748–757

  6. Lee G, Chang C-Y, Chen ALP (2004) Hiding sensitive patterns in association rules mining. In: 2004 Proceedings of the 28th annual international, computer software and applications conference, 2004. COMPSAC. IEEE, pp 424–429

  7. Menon S, Sarkar S, Mukherjee S (2005) Maximizing accuracy of shared databases when concealing sensitive patterns, vol 16

  8. Moustakides GV, Verykios VS (2008) A maxmin approach for hiding frequent itemsets. Data Knowl Eng 65(1):75–89

    Google Scholar 

  9. https://spark.apache.org/docs/2.3.0/

  10. Oliveira SRM, Zaiane OR (2002) Privacy preserving frequent itemset mining. In: Proceedings of the IEEE international conference on privacy, security and data mining, vol 14. Australian Computer Society Inc., pp 43–54

  11. Oliveira SRM, Zaiane OR (2003) Protecting sensitive knowledge by data sanitization.. In: Third IEEE International conference on data mining, 2003. ICDM 2003. IEEE, pp 613–616

  12. Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems

  13. Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy. ACM, New York, pp 195–6

  14. Sun X, Yu PS (2005) A borderbased approach for hiding sensitive frequent itemsets. In: Fifth IEEE international conference on data mining. IEEE, pp 8–

  15. Shivani S, Toshniwal D (2017) Scalable two-phase co-occurring sensitive pattern hiding using MapReduce. J Big Data 4(1):4

    Google Scholar 

  16. Sharma S, Toshniwal D (2015) Parallelization of association rule mining: survey. In: 2015 International conference on computing, communication and security (ICCCS), Pamplemousses, pp 1–6

  17. Zhang X, et al. (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25.2:363–373

    Google Scholar 

  18. Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2114) Parallel processing systems for big data: a survey. Proc IEEE 104(11):2016

    Google Scholar 

  19. Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering (ICDE’05). IEEE, New York

  20. Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems

  21. Han Z, Zhang Y (2015) Spark: a big data processing platform based on memory computing. In: 2015 Seventh international symposium on parallel architectures, algorithms and programming (PAAP), Nanjing, pp 172–176

  22. Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, pp 195–206

  23. Hong TP, Lin CW, Yang KT, Wang SL (2011) A heuristic data-sanitization approach based on TF-IDF. In: International conference on industrial engineering and other applications of applied intelligent systems, pp 156–164

  24. Cheng P, Roddick JF, Chu SC, Lin CW (2016) Privacy preservation through a Greedy, distortion-based rule-hiding method. Appl Intell 44(2):295–306

    Google Scholar 

  25. Telikani A, Shahbahrami A, Tavoli R (2015) Data sanitization in association rule mining based on impact factor. J AI Data Min 3(2):131–140

    Google Scholar 

  26. https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shivani Sharma.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Comparison with existing parallel SPH techniques

Appendix: Comparison with existing parallel SPH techniques

Another set of the experiment has been performed to judge the performance of PGVIR and PHCR with respect to the parallel version of MaxFIA and SWA schemes proposed in [10, 11]. The first experiment has been set up with varying data sizes. Figure 12a plots the execution time taken by the sanitization process with varying data sizes. The MST is set to 20% and the total sensitive patterns need to be masked is 50. It can be clearly observed that due to both ways parallelization i.e. data parallelization and computing parallelization achieved in PGVIR and PHCR, the proposed scheme performs better than the existing state of art. Further, proposed schemes are implemented using the Spark platform and Parallel MaxFIA and SWA have been implemented using Hadoop MapReduce which again is the reason for the clear difference. Spark [26] platform is faster than the Hadoop MapReduce due to several reason like in-memory computation, data frame creation etc. Further the initilization time of hadoop is much higher than the Spark.

Second Set of experiment have been performed to analyze the performance in terms of running time with varying minimum support threshold value. Figure 12b presents the plot between running time and varying MST. It can be observed that with different minimum threshold value the execution time of PGVIR and PHCR is considerably less than the parallel MaxFIA and SWA. Therefore, it can be stated that due to both ways parallelization and use of the Spark platform make proposed PGVIR and PHCR a better choice for preserving the privacy of sensitive data.

Fig. 12
figure 12

Running Time of Parallel MaxFIA and SWA Vs Proposed PGVIR and PHCR

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, U., Toshniwal, D. & Sharma, S. A sanitization approach for big data with improved data utility. Appl Intell 50, 2025–2039 (2020). https://doi.org/10.1007/s10489-020-01640-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01640-4

Keywords

Navigation