MICF: An effective sanitization algorithm for hiding sensitive patterns on data mining
Introduction
Data mining is an interdisciplinary field bringing together techniques to extract information from large databases [1]. Over the past few years, data mining mechanisms have widely been applied in various business organizations (for example, retail, insurance, finance, banking, and communication) and manufacturing companies such as Texas Instruments (fault diagnosis), Ford (harshness, noise, and vibration analysis), Motorola (CDMA Base Station Placement), Boeing (Post-flight Diagnostics), and Kodak (data visualization) [2].
Large companies use powerful data acquisition systems (such as minicomputers, microprocessors, and senor networks) to collect, analyze, and transfer data. The role of knowledge discovery in databases (KDD) and data mining methodologies, therefore, has become extremely important for extracting useful knowledge from huge amounts of raw data [2], [3].
For design procedure capture, which is a specific engineering knowledge that can be captured from the design events monitored during design process [4], data mining techniques can be used in different stages of production. Manufacturing is typically a controlled process such as wafer fabrication. Using mining tools to analyze the collected data can result in efficient strategies to improve the production process, find out the unusual steps during the manufacturing process [5], and improve the reliability of system identification [6].
Data mining is an evolutionary step along the path of problem solving through data analysis [1]. Releasing collected data or mined rules for sharing has become a crucial trend among business partnerships as it results in increased productivity for all companies involved. Nevertheless, the released data have also increased the risk of incurring sensitive information leaks [7]. Organizations should evaluate and decrease the risk of disclosing information. Therefore, developing efficient approaches to maintaining an organization’s competitive edge in business by restricting information leakages has become an important issue. Consider the following two scenarios:
- Scenario one:
Some paper manufacturers have their own databases that record their patterns of stock and sale. For their mutual benefit, multiple companies decide to share their databases to cooperatively generate information and trends found in the shared large database. However, each company prefers to keep their own strategic patterns confidential from the others. Thus, a company can both uncover more interesting trends from the combined shared database than that available only from their own database, and can prevent sensitive information from being discovered by other companies [8].
- Scenario two:
The captured design procedure knowledge helps companies to understand how experienced designers carry out their designs and guide other designers to design better. Moreover, the knowledge can be used for training novice designers so that they can quickly learn how to execute prominent designs [4]. In order to preserve strategic or sensitive intelligence and still share such knowledge among allied companies, a data sanitization process or privacy-preserving techniques must be applied.
An intelligent system can be developed to achieve this. Moreover, the privacy-preserving data mining will be one of the key techniques in such a system. The mechanisms usually transform the source database into a new one from which sensitive information cannot be extracted. The procedure of transforming the source database into a new database that hides some sensitive rules is called the sanitization process [9].
In the association rule mining, all rules are derived from the frequent itemsets (patterns). Accordingly, one essential and efficient way to protect some restrictive patterns is to decrease their support values, which can be done by deleting or modifying items in several transactions. Such approaches usually follow two restrictions: (1) reduce the impact on the source database; and (2) find an appropriate balance between privacy and knowledge discovery. An itemset which must be concealed is called a restrictive itemset. The impact on a source database of deleting items from transactions can be measured by the sanitization rate. The sanitization rate is defined as the ratio of deleted items to the total support values of restrictive itemsets in the source dataset. The sanitization process can also conceal some non-restrictive itemsets, which is a side effect called the “misses cost”. An optimal sanitization process, which both conceals all restrictive patterns and minimize the misses cost, is an NP-hard problem [7].
The item grouping algorithm (IGA) has been proposed to enforce privacy in mining frequent itemsets [10]. IGA groups restrictive itemsets and assigns a victim item to each group. This approach can reduce the impact on the database if the sanitized transaction contains more than one restrictive itemset.
IGA has a low misses cost, but can be improved further by reducing the number of deleted items. Moreover, it must deal with the overlap between groups. To deal with this problem, this study proposes a novel algorithm called maximum item conflict first (MICF). The degree of conflict of an item in a sensitive transaction is defined as the number of restrictive patterns affected when an item is deleted. MICF selects an item with the maximum degree of conflict for deletion. Therefore, MICF simultaneously decreases the support values for the maximum number of restrictive patterns and reduces the number of items to be removed from the source database.
This study focuses on the task of deleting items from transactions to conceal frequent itemsets in association rule mining and on the issue of no hiding failure. All association rules derivable from these hidden frequent itemsets are thus also hidden. That is, no extra artificial itemset can be generated from the sanitized dataset.
The rest of this paper is organized as follows. Section 2 presents an overview of the current methods for solving the problem of privacy-preserving association rule mining. Section 3 explains the proposed maximum item conflict first (MICF) algorithm for hiding all restrictive itemsets. The time complexity analysis is represented in Section 4. Section 5 provides experimental results and evaluates the performance of the proposed algorithm. Finally, we conclude in Section 6 with a summary of our work.
Section snippets
Mining association rules
The problem of mining association rules have been first presented in 1993 [11]. Recently, mining association rules plays one of the most important roles in data mining. Given a transaction database, mining association rules attempts to discover the significant relationship among items. The formal definition is as follows.
Let DB denote a transaction database, which is a set of transactions. DB = {T1, T2, … , Tz}. Let I = {i1, i2, … , in} be all set of items in DB. Thus, ∀Tq ∈ DB, Tq ⊆ I, 1 ⩽ q ⩽ z. Let X be a set
Maximum item conflict first (MICF) method
Given the restrictive itemsets, frequent itemsets, and source database, the goal of the sanitization process is to protect restrictive frequent itemsets against the mining techniques used to discover them. The sanitization process, which decreases the support values of restrictive itemsets by removing items from sensitive transactions essentially, includes four sub-problems:
- (1)
Identifying the set of sensitive transactions for each restrictive itemset.
- (2)
Selecting the partial sensitive transactions to
Sanitization rate
Definition 4.1 If removing an item from a transaction results in a reduction by one of the support count of some restrictive itemsets, where the current support count of each restrictive itemset rj is greater than , this removal is called a valid item-sanitization; otherwise it is called an invalid item-sanitization. Clearly, the support count of rj reduces one after the removal. Theorem 4.1 Let a sanitization process have no invalid item-sanitization. Let rj be an arbitrary restrictive itemset in RI, where 1 ⩽ j ⩽
Experimental results
To measure the effectiveness of MICF, experiments were conducted on both simulated and real datasets to compare its misses costs and sanitization rates with that of Algo2b, MaxFIA, MinFIA and IGA. All experiments were performed on an AMD Barton ES 2900+ (2000 MHz) PC with 1 GB of main memory, running Windows XP Professional. All algorithms were coded in Visual C++ 6.0, and employed the same array structure to store all restrictive transactions in the main memory.
Given a minimum support threshold
Conclusions
Data mining techniques can discover useful information from databases. Accurate input data leads to meaningful mining results, but problems arise when users provide fictitious data to protect their privacy. In this competitive but also cooperative business environment, companies need to share information with others, while at the same time protecting their own confidential strategic. For this kind of data sharing to be possible, this study proposes the maximum item conflict first (MICF)
Acknowledgement
The authors would like to acknowledge the helpful comments made by the anonymous reviewers of this paper and thank Blue Martini Software, Inc., for kindly providing the BMS datasets.
References (28)
- et al.
An information value based approach to design procedure capture
Advanced Engineering Informatics
(2006) - et al.
Data mining techniques for improving the reliability of system identification
Advanced Engineering Informatics
(2005) - et al.
Discovering Data Mining from Concept to Implementation
(1998) Data Mining: Concepts, Models, Methods, and Algorithms
(2002)- et al.
Mining wafer fabrication: framework and challenges
- M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, Disclosure limitation of sensitive rules, in:...
- et al.
Privacy-preserving distributed mining of association rules on horizontally partitioned data
IEEE Transactions on Knowledge and Data Engineering
(2004) - A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, Privacy preserving mining of association rules, in: Proceedings of...
- S.R.M. Oliveira, O.R. Zaïane, Privacy preserving frequent itemset mining, in: Proceedings of IEEE ICDM Workshop on...
Cited by (19)
Privacy preserving rare itemset mining
2024, Information SciencesPrivacy-preserving in association rule mining using an improved discrete binary artificial bee colony
2020, Expert Systems with ApplicationsCitation Excerpt :(hiding failure): a sensitive rule is failed to be hidden if the number of the sanitized transactions is not equal to its minimum sanitization ratio. A sanitization process with a low minimum support threshold cannot avoid hiding failure (Li & Chang, 2007; Telikani & Shahbahrami, 2018). (new rule): A weak rule is one that its generative itemset is frequent but its confidence is below βmin.
Data sanitization in association rule mining: An analytical review
2018, Expert Systems with ApplicationsCitation Excerpt :The compact GA approach (Harik et al., 1999) was applied in cpGA2DT (Lin et al., 2014a) to generate only two individuals per population for competition in order to reduce the memory usage. The MICF (Li & Chang, 2007) initially loads all sensitive transactions into the main memory. Therefore, the transactions are sanitized in the main memory instead of the disk.
Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining
2016, Engineering Applications of Artificial IntelligenceSecure itemset hiding in smart city sensor data
2024, Cluster ComputingAssociation rule hiding using integer linear programming
2021, International Journal of Electrical and Computer Engineering
- 1
Tel.: +886 4 26328001x18113; fax: +886 4 26324045.
- 2
Tel.: +886 4 24517250x3790; fax: +886 4 27066495.