Hiding sensitive information in eHealth datasets

https://doi.org/10.1016/j.future.2020.11.026Get rights and content

Highlights

  • This paper investigates two GA-based models for hiding sensitive information.

  • A minimal support threshold function is proposed to identify different support thresholds for varied lengths of patterns.

  • A “pre-large” concept is administered within the designed GA-based model.

  • Experiments show that the designed model outperforms the generic EC-based models.

Abstract

Privacy in the realm of data mining known as PPDM has become a hot topic in both academic research and industry due to the fact it can discover implicit rules as well as hide sensitive information for data sanitization. Many different algorithms and heuristics have been investigated to hide sensitive information using the act of transaction deletion based on evolutionary computation techniques, but to date, these algorithms only consider a uniform threshold value for sanitization progress. This technique is not applicable in real-world situations, especially for eHealth based medical datasets. For example, a patient can still be identified if he/she has more confidential information (i.e., symptoms) that cause privacy threats and security leakage in medical applications. In this work, we investigate a unique novel methodology to set varied threshold values that lead to varied lengths of sensitive patterns within a Genetic Algorithm (GA)-based framework. As the pattern length increases, a tighter threshold manifests to provide better protection of sensitive information that can avoid individual patients to be identified in eHealth datasets. Two GA-based models are developed for data sanitization using record deletion techniques. The experimental results are conducted and compared with the traditional Evolutionary Computation (EC)-based PPDM approaches and the results showed that the designed methods offer greater protection than previous methods in terms of side effects. Therefore, the designed models are effective to hide sensitive information in medical situations that can be used in real-world scenarios.

Introduction

With the rapid growth of information techniques used in the Internet of Things (IoT) [1], such as parallel computing [2], edge computing [3], and machine learning [4], it is necessary to use those techniques to retrieve useful information for decision-making. In the past decade, several data mining techniques [5], [6], [7], [8], [9] have been utilized and applied in different domains and applications that can be used to retrieve meaningful information with an intrinsic value from massive datasets. The fundamental algorithm to mine required patterns is called the Apriori [10] that uses the minimum support threshold to first discover the set of FIs (frequent itemsets) from the database using a “level-wise” technique. Next, a combinational approach is administered that applies effective association rules, known as ARs, that are solely based using the minimum confidence threshold. Taking into account that the “level-wise” approach applies the well-known “generate-and-test” mechanism which is known to require a huge computational cost, the efficient frequent pattern (FP)-tree structure [11] is developed to keep frequent 1-itemsets in the tree. Next, a recursive FP-growth mining algorithm is developed for mining the set of FIs. Several extensions in knowledge discovery in the database (KDD) were then implemented to handle different scenarios and domains for retrieving various knowledge for decision-making, i.e., sequential pattern mining [12] and high-utility itemset mining [13].

Although KDD techniques can be used to mine the relationship of attributes in a database, confidential/private information can also be revealed or referred from related information during the mining progress [14]. For example, purchase behaviors can refer to visited malls and even the gender of customers that should be considered confidential information in data analytics. One technique includes a perturbation or sanitization approach in which the confidential information regarding any patient’s record is perturbed using a random process. This process distorts sensitive data values by changing them by adding, subtracting, or perturbing the date through other means. Lindell and Pinkas investigated the ID3 algorithm [15] based on the decision tree for PPDM. Clifton et al. developed software that can be utilized for solving the PPDM problem [16]. Dwork et al. [17] designed several models that handle published noisy stats on top of the vertically partitioned datasets. Wu et al. [18] also created several algorithms that are used to reduce support as well as confidence, and by doing so hiding SI (sensitive information) through decreasing the support as well as the confidence values. Besides, Hong et al. [19] considered the TF-IDF model and developed the SIF-IDF algorithm which is used to evaluate a calculated score for every transaction used in data sanitization. Moreover, there are also a lot of challenges to reveal useful knowledge in the eHealth field due to strict privacy requirements [20], [21], [22], [23], [24]. Besides, for the sanitization process to hide SIs, the loss of rules can be common which leads to artificial rules that may appear as side effects of the entire sanitization process. Three well-known side effects in any PPDM process are (1) hiding failure, (2) missing cost, and (3) artificial cost, which can all be considered as the evaluation criteria for data sanitization. However, the three side effects mentioned can be considered as an NP-hard optimization problem [25], [26] since the appropriate data is selected for sanitization with minimal side effects. Evolutionary computation (EC) is an alternative method to find optimized solutions that have been applied to solve many NP-hard issues. Lin et al. [27] first considered to apply the GA for PPDM and developed a GA-based model for data sanitization. Lin et al. then considered the PSO model and presented three PSO-based approaches [28], [29], [30] to improve the effectiveness and efficiency of the GA-based approach for data sanitization. Many other novel PPDM algorithms and frameworks were also vigorously developed recently [31], [32]. Although those models are more effective than generic approaches for data sanitization, they still fail the problem by using a uniform threshold for sanitization, which is not applicable in many domains and applications. The problem is that a longer confidential pattern can still lead to disclosure with a uniform threshold value. For example, in medical datasets, a patient can be identified if he/she has many symptoms of a given disease that can cause a privacy threat and security issues to protect the patient’s information. If a single loose threshold is set, many private long patterns will be identified. On the other hand, if a single strict threshold is set, the sanitized database suffers serious side-effects inevitably. In this paper, we aim at providing a secured privacy preservation system that can be utilized in a medical dataset for data sanitization. The major contributions of our work are listed as follows:

  • 1.

    This paper investigates two GA-based models for hiding sensitive information based on varied threshold values of sensitive patterns, which is more applicable in real-world situations especially in eHealth based medicaldatasets.

  • 2.

    A minimal support threshold function is proposed to identify different support thresholds for varying lengths of the patterns, thus the sensitive patterns with long lengths cannot be easily identified that secure the privacy of the patients.

  • 3.

    A pre-large concept is administered within the designed GA-based model thus reducing the computational cost for evaluation.

  • 4.

    Experiments showed that the designed models outperform the generic EC-based model with a uniform threshold value for data sanitization in terms of three side effects.

Section snippets

Related work

Through our related work, algorithms for genetic algorithm and known techniques for PPDM are respectively reviewed and discussed.

Preliminaries and problem statement

In this section, preliminary details and the definition of hidden sensitive information in an identifiable health dataset are introduced.

Proposed GA-based sanitization models

In this section, the proposed GA-based framework for efficiently hiding sensitive patterns by record deletion in a dataset is described. They are separately extended from the previous sGA2DT and pGA2DT algorithm with the proposedmulti-threshold concept and be applied in identifiable health datasets.

Experimental results

The developed GA-based algorithm was compared in substantial experiments with previous sGA2DT and pGA2DT sanitization methods [27]. A normal distribution function is set to obtain the minimum support threshold automatically (the mean value is set to 1). The experimental algorithms have been applied on a Macbook computer with Intel i5 2.7 GHz Processor, implemented using Java and performed on macOS Mojave with 16 GB Ram. There are two open identifiable health database from UCI (University of

Conclusion and future work

In the past, numerous previous activities have been performed using Evolutionary Computing methods to hide sensitive information. However, those EC-based models only hold a uniform threshold value for data sanitization. However, previous models do not suitably hide confidential patterns with longer lengths (i.e., more attributes). This paper presents two GA-based algorithms respectively called MGA2DR and pMGA2DR to hide sensitive information with a varied threshold function, which can be

CRediT authorship contribution statement

Jimmy Ming-Tai Wu: Conceptualization,Writing - original draft. Gautam Srivastava: Funding acquisition, Data curation, Writing - original draft. Alireza Jolfaei: Formal analysis, Proof reading. Philippe Fournier-Viger: Writing - original draft. Jerry Chun-Wei Lin: Investigation, Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant program (RGPIN-2020-05363) held by Dr. G. Srivastava.

Jimmy Ming-Tai Wu received the Ph.D. degree in computer science and engineering from National Sun Yat-sen University, Kaohsiung, Taiwan. He was a Research Scholar with the Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, and with the Department of Computer Science, College of Engineering, University of Nevada at Las Vegas, and an Assistant Professor with the Harbin Institute of Technology t Shenzhen, China. He is currently an

References (49)

  • S. Liu, X. Liu, S. Wang, K. Muhammad, Fuzzy-aided solution for out-of-view challenge in visual tracking under...
  • LiuS. et al.

    A robust parallel object tracking method for illumination variations

    Mob. Netw. Appl.

    (2019)
  • J.C.W. Lin, Y. Shao, Y. Djenouri, U. Yun, Asrnn: A recurrent neural network with an attention model for sequence...
  • GanW. et al.

    Data mining in distributed environment: a survey

    Wiley Interdiscip. Rev.: Data Min. Knowl. Discov.

    (2017)
  • GanW. et al.

    A survey of incremental high-utility itemset mining

    Wiley Interdiscip. Rev.: Data Min. Knowl. Discov.

    (2018)
  • T.Y. Wu, J.C.W. Lin, U. Yun, C.a. Chen, G. Srivastava, X. Lv, An efficient algorithm for fuzzy frequent itemset mining,...
  • WuJ.M.T. et al.

    High-utility itemset mining with effective pruning strategies

    ACM Trans. Knowl. Discov. Data

    (2019)
  • G. Srivastava, J.C.W. Lin, M. Pirouz, Y. Li, U. Yun, A pre-large weighted-fusion system of sensed high-utility...
  • AgrawalR. et al.

    Fast algorithms for mining association rules

  • HanJ. et al.

    Mining frequent patterns without candidate generation: A frequent-pattern tree approach

    Data Min. Knowl. Discov.

    (2004)
  • GanW. et al.

    A survey of parallel sequential pattern mining

    ACM Trans. Knowl. Discov. Data

    (2019)
  • GanW. et al.

    A survey of utility-oriented pattern mining

    IEEE Trans. Knowl. Data Eng.

    (2019)
  • AgrawalR. et al.

    Privacy-preserving data mining

  • LindellY. et al.

    Privacy preserving data mining

  • Cited by (0)

    Jimmy Ming-Tai Wu received the Ph.D. degree in computer science and engineering from National Sun Yat-sen University, Kaohsiung, Taiwan. He was a Research Scholar with the Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, and with the Department of Computer Science, College of Engineering, University of Nevada at Las Vegas, and an Assistant Professor with the Harbin Institute of Technology t Shenzhen, China. He is currently an Assistant Professor with the College f Computer Science and Engineering, Shandong University of Science and technology, China. He worked in an IC design company in Taiwan as a firmware developer and an information technology manager in two years. His current research interests include big data, cloud computing, and the Internet of Things (IoT).

    Gautam Srivastava was awarded his B.Sc. degree from Briar Cliff University in the U.S.A. in the year 2004, followed by his M.Sc. and Ph.D. degrees from the University of Victoria in Victoria, British Columbia, Canada in the years 2006 and 2012, respectively. He then taught for 3 years at the University of Victoria in the Department of Computer Science, where he was regarded as one of the top undergraduate professors in the Computer Science Course Instruction at the University. From there in the year 2014, he joined a tenure-track position at Brandon University in Brandon, Manitoba, Canada, where he currently is active in various professional and scholarly activities. He was promoted to the rank of Associate Professor in January 2018. Dr. G, as he is popularly known, is active in research in the field of Cryptography, Data Mining, Security and Privacy, and Blockchain Technology. In his 5 years as a research academic, he has published a total of 170 papers in high-impact conferences in many countries and in high status journals (SCI, SCIE) and has also delivered invited guest lectures on Big Data, Cloud Computing, Internet of Things, and Cryptography at many universities worldwide. He is an Editor of several SCI/SCIE journals. He is an IEEE Senior Member and also an Associate editor of the world renowned IEEE Access journal.

    Alireza Jolfaei received the Ph.D. degree in Applied Cryptography from Griffith University, Gold Coast, Australia. He is a Lecturer (Assistant Professor in North America) and a Program Leader of Cyber Security at Macquarie University, Sydney, Australia. Before this appointment, he worked as an Assistant Professor at Federation University Australia and Temple University in Philadelphia, USA. His current research areas include Cyber Security, IoT Security, Human-in-the-Loop CPS Security, Cryptography, AI and Machine Learning for Cyber Security. He has authored over 70 peer-reviewed articles on topics related to cybersecurity. He has received multiple awards for Academic Excellence, University Contribution, and Inclusion and Diversity Support. He received the prestigious IEEE Australian council award for his research paper published in the IEEE Transactions on Information Forensics and Security. He received a recognition diploma with a cash award from the IEEE Industrial Electronics Society for his publication at the 2019 IEEE IES International Conference on Industrial Technology. He is a founding chair of the Federation University IEEE Student Branch.

    Philippe Fournier-Viger, Ph.D. is a Full Professor. His interests are data mining, algorithm design, pattern mining, sequence mining, big data, and applications. He is the founder of the popular SPMF data mining library, offering more than 130 algorithms, cited in more than 700 research papers since 2010. He has also participated in more than 230 research papers, which have received more than 3600 citations (as of 2019/06). http://www.philippe-fournier-viger.com.

    Jerry Chun-Wei Lin received his Ph.D. from the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan in 2010. He worked as an Assistant Professor from 2012 to 2016 in Harbin Institute of Technology (Shenzhen), China, and worked as an Associate Professor (Ph.D. supervisor) from 2016 to 2018. He jointed the Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Norway University of Applied Sciences, Bergen, Norway in 2018 as the tenured Associate Professor, and was promoted as the tenured full Professor in 2020. He has published more than 300 research articles in refereed journals (IEEE TKDE, IEEE TCYB, IEEE SysJ, IEEE SensJ, ACM TKDD, ACM TDS, ACM TMIS) and international conferences (IEEE ICDE, IEEE ICDM, DASSFA, PKDD, PAKDD). His research interests include data mining, soft computing, artificial intelligence and machine learning, and privacy preserving and security technologies. He is also the project co-leader of well-known SPMF and also the funder and project leader of PPSF library. He is the Editor-in- Chief of the International Journal of Data Science and Pattern Recognition, Associate/Guest Editor of IEEE Access, JIT, PlosOne, International Journal of Interactive Multimedia and Artificial Intelligence, IEEE TFS, IEEE TII, ACM TMIS, Applied Sciences, and Sensors. He is the Fellow of IET (FIET), senior member for both IEEE and ACM.

    View full text