MICF: An effective sanitization algorithm for hiding sensitive patterns on data mining

doi:10.1016/j.aei.2006.12.003

Advanced Engineering Informatics

Volume 21, Issue 3, July 2007, Pages 269-280

https://doi.org/10.1016/j.aei.2006.12.003 Get rights and content

Abstract

Data mining mechanisms have widely been applied in various businesses and manufacturing companies across many industry sectors. Sharing data or sharing mined rules has become a trend among business partnerships, as it is perceived to be a mutually benefit way of increasing productivity for all parties involved. Nevertheless, this has also increased the risk of unexpected information leaks when releasing data. To conceal restrictive itemsets (patterns) contained in the source database, a sanitization process transforms the source database into a released database that the counterpart cannot extract sensitive rules from. The transformed result also conceals non-restrictive information as an unwanted event, called a side effect or the “misses cost”. The problem of finding an optimal sanitization method, which conceals all restrictive itemsets but minimizes the misses cost, is NP-hard. To address this challenging problem, this study proposes the maximum item conflict first (MICF) algorithm. Experimental results demonstrate that the proposed method is effective, has a low sanitization rate, and can generally achieve a significantly lower misses cost than those achieved by the MinFIA, MaxFIA, IGA and Algo2b methods in several real and artificial datasets.

Introduction

Data mining is an interdisciplinary field bringing together techniques to extract information from large databases [1]. Over the past few years, data mining mechanisms have widely been applied in various business organizations (for example, retail, insurance, finance, banking, and communication) and manufacturing companies such as Texas Instruments (fault diagnosis), Ford (harshness, noise, and vibration analysis), Motorola (CDMA Base Station Placement), Boeing (Post-flight Diagnostics), and Kodak (data visualization) [2].

Large companies use powerful data acquisition systems (such as minicomputers, microprocessors, and senor networks) to collect, analyze, and transfer data. The role of knowledge discovery in databases (KDD) and data mining methodologies, therefore, has become extremely important for extracting useful knowledge from huge amounts of raw data [2], [3].

For design procedure capture, which is a specific engineering knowledge that can be captured from the design events monitored during design process [4], data mining techniques can be used in different stages of production. Manufacturing is typically a controlled process such as wafer fabrication. Using mining tools to analyze the collected data can result in efficient strategies to improve the production process, find out the unusual steps during the manufacturing process [5], and improve the reliability of system identification [6].

Data mining is an evolutionary step along the path of problem solving through data analysis [1]. Releasing collected data or mined rules for sharing has become a crucial trend among business partnerships as it results in increased productivity for all companies involved. Nevertheless, the released data have also increased the risk of incurring sensitive information leaks [7]. Organizations should evaluate and decrease the risk of disclosing information. Therefore, developing efficient approaches to maintaining an organization’s competitive edge in business by restricting information leakages has become an important issue. Consider the following two scenarios:

Scenario one:
Some paper manufacturers have their own databases that record their patterns of stock and sale. For their mutual benefit, multiple companies decide to share their databases to cooperatively generate information and trends found in the shared large database. However, each company prefers to keep their own strategic patterns confidential from the others. Thus, a company can both uncover more interesting trends from the combined shared database than that available only from their own database, and can prevent sensitive information from being discovered by other companies [8].
Scenario two:
The captured design procedure knowledge helps companies to understand how experienced designers carry out their designs and guide other designers to design better. Moreover, the knowledge can be used for training novice designers so that they can quickly learn how to execute prominent designs [4]. In order to preserve strategic or sensitive intelligence and still share such knowledge among allied companies, a data sanitization process or privacy-preserving techniques must be applied.

An intelligent system can be developed to achieve this. Moreover, the privacy-preserving data mining will be one of the key techniques in such a system. The mechanisms usually transform the source database into a new one from which sensitive information cannot be extracted. The procedure of transforming the source database into a new database that hides some sensitive rules is called the sanitization process [9].

In the association rule mining, all rules are derived from the frequent itemsets (patterns). Accordingly, one essential and efficient way to protect some restrictive patterns is to decrease their support values, which can be done by deleting or modifying items in several transactions. Such approaches usually follow two restrictions: (1) reduce the impact on the source database; and (2) find an appropriate balance between privacy and knowledge discovery. An itemset which must be concealed is called a restrictive itemset. The impact on a source database of deleting items from transactions can be measured by the sanitization rate. The sanitization rate is defined as the ratio of deleted items to the total support values of restrictive itemsets in the source dataset. The sanitization process can also conceal some non-restrictive itemsets, which is a side effect called the “misses cost”. An optimal sanitization process, which both conceals all restrictive patterns and minimize the misses cost, is an NP-hard problem [7].

The item grouping algorithm (IGA) has been proposed to enforce privacy in mining frequent itemsets [10]. IGA groups restrictive itemsets and assigns a victim item to each group. This approach can reduce the impact on the database if the sanitized transaction contains more than one restrictive itemset.

IGA has a low misses cost, but can be improved further by reducing the number of deleted items. Moreover, it must deal with the overlap between groups. To deal with this problem, this study proposes a novel algorithm called maximum item conflict first (MICF). The degree of conflict of an item in a sensitive transaction is defined as the number of restrictive patterns affected when an item is deleted. MICF selects an item with the maximum degree of conflict for deletion. Therefore, MICF simultaneously decreases the support values for the maximum number of restrictive patterns and reduces the number of items to be removed from the source database.

This study focuses on the task of deleting items from transactions to conceal frequent itemsets in association rule mining and on the issue of no hiding failure. All association rules derivable from these hidden frequent itemsets are thus also hidden. That is, no extra artificial itemset can be generated from the sanitized dataset.

The rest of this paper is organized as follows. Section 2 presents an overview of the current methods for solving the problem of privacy-preserving association rule mining. Section 3 explains the proposed maximum item conflict first (MICF) algorithm for hiding all restrictive itemsets. The time complexity analysis is represented in Section 4. Section 5 provides experimental results and evaluates the performance of the proposed algorithm. Finally, we conclude in Section 6 with a summary of our work.

Section snippets

Mining association rules

The problem of mining association rules have been first presented in 1993 [11]. Recently, mining association rules plays one of the most important roles in data mining. Given a transaction database, mining association rules attempts to discover the significant relationship among items. The formal definition is as follows.

Let DB denote a transaction database, which is a set of transactions. DB = {T₁, T₂, … , T_z}. Let I = {i₁, i₂, … , i_n} be all set of items in DB. Thus, ∀T_q ∈ DB, T_q ⊆ I, 1 ⩽ q ⩽ z. Let X be a set

Maximum item conflict first (MICF) method

Given the restrictive itemsets, frequent itemsets, and source database, the goal of the sanitization process is to protect restrictive frequent itemsets against the mining techniques used to discover them. The sanitization process, which decreases the support values of restrictive itemsets by removing items from sensitive transactions essentially, includes four sub-problems:

(1)
Identifying the set of sensitive transactions for each restrictive itemset.
(2)
Selecting the partial sensitive transactions to

Sanitization rate

Definition 4.1

If removing an item from a transaction results in a reduction by one of the support count of some restrictive itemsets, where the current support count of each restrictive itemset r_j is greater than $⌊| {db}_{r_{j}} | \times ψ⌋$ , this removal is called a valid item-sanitization; otherwise it is called an invalid item-sanitization. Clearly, the support count of r_j reduces one after the removal.

Theorem 4.1

Let a sanitization process have no invalid item-sanitization. Let r_j be an arbitrary restrictive itemset in RI, where 1 ⩽ j ⩽

Experimental results

To measure the effectiveness of MICF, experiments were conducted on both simulated and real datasets to compare its misses costs and sanitization rates with that of Algo2b, MaxFIA, MinFIA and IGA. All experiments were performed on an AMD Barton ES 2900+ (2000 MHz) PC with 1 GB of main memory, running Windows XP Professional. All algorithms were coded in Visual C++ 6.0, and employed the same array structure to store all restrictive transactions in the main memory.

Given a minimum support threshold

Conclusions

Data mining techniques can discover useful information from databases. Accurate input data leads to meaningful mining results, but problems arise when users provide fictitious data to protect their privacy. In this competitive but also cooperative business environment, companies need to share information with others, while at the same time protecting their own confidential strategic. For this kind of data sharing to be possible, this study proposes the maximum item conflict first (MICF)

Acknowledgement

The authors would like to acknowledge the helpful comments made by the anonymous reviewers of this paper and thank Blue Martini Software, Inc., for kindly providing the BMS datasets.

References (28)

Y. Ishino et al.
An information value based approach to design procedure capture
Advanced Engineering Informatics
(2006)
S. Saitta et al.
Data mining techniques for improving the reliability of system identification
Advanced Engineering Informatics
(2005)
P. Cabena et al.
Discovering Data Mining from Concept to Implementation
(1998)
M. Kantardzic
Data Mining: Concepts, Models, Methods, and Algorithms
(2002)
J. Soenjaya et al.
Mining wafer fabrication: framework and challenges
M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, Disclosure limitation of sensitive rules, in:...
M. Kantarcioglu et al.
Privacy-preserving distributed mining of association rules on horizontally partitioned data
IEEE Transactions on Knowledge and Data Engineering
(2004)
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, Privacy preserving mining of association rules, in: Proceedings of...
S.R.M. Oliveira, O.R. Zaïane, Privacy preserving frequent itemset mining, in: Proceedings of IEEE ICDM Workshop on...

R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings...

R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of 20th International Conference...

J.S. Park, M.S. Chen, P.S. Yu, An effective hash-based algorithm for mining association rules, in: Proceedings of 1995...

S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in:...

Cited by (19)

Privacy preserving rare itemset mining
2024, Information Sciences
In recent years, rare pattern mining has shown great vitality in some real-world fields, such as disease diagnosis, criminal behavior analysis, anomaly detection in networks, and so on. When data organizations publish or share information publicly, shared data can be at risk of leakage as data mining techniques may discover sensitive knowledge and information. To keep competitors from obtaining hidden information after processing the database, privacy-preserving data mining (PPDM) has been proposed and studied widely. However, most of the techniques in PPDM are applied to frequent pattern mining and cannot deal with the privacy protection problems in rare pattern mining, such as network vulnerability detection and abnormal medical data. To address this limitation, we introduce a privacy-preserving technique for rare pattern mining. In this paper, two novel algorithms named Longest Transaction-Minimum Item Number (LT-MIN) and Longest Transaction-Maximum Item Number (LT-MAX) are proposed to hide sensitive rare itemsets and return the sanitized database. These two algorithms succeed in hiding target itemsets while minimizing the side effects on the original database. What's more, they employ a projection mechanism to reduce the time spent scanning the database. Besides using the traditional evaluation criteria in PPDM, we also propose two additional similarity measures to evaluate the performance from the perspective of the itemsets and the structural integrity of the database. The experimental results indicate that the proposed algorithms can hide sensitive rare itemsets successfully and efficiently, and the evaluation methods used can become the evaluation criteria for privacy-preserving rare itemset mining (PPRIM).
Privacy-preserving in association rule mining using an improved discrete binary artificial bee colony
2020, Expert Systems with Applications
Citation Excerpt :
(hiding failure): a sensitive rule is failed to be hidden if the number of the sanitized transactions is not equal to its minimum sanitization ratio. A sanitization process with a low minimum support threshold cannot avoid hiding failure (Li & Chang, 2007; Telikani & Shahbahrami, 2018). (new rule): A weak rule is one that its generative itemset is frequent but its confidence is below βmin.
Association Rule Hiding (ARH) is the process of protecting sensitive knowledge using data transformation. Although there are some evolutionary-based ARH algorithms, they mostly focus on the itemset hiding instead of the rule hiding. Besides, unstable convergence to the global optimum solution and designing long solutions make them inappropriate in reducing side effects. They use the basic versions of evolutionary approaches, resulting in inappropriate performance in ARH domain where the search space is large and the algorithms easily get trapped in local optima. To deal with these problems, we propose a new rule hiding algorithm based on a binary Artificial Bee Colony (ABC) approach which has good exploration. However, we improve the binary ABC algorithm to enhance its poor exploitation by designing a new neighborhood generation mechanism to balance between exploration and exploitation. We called this algorithm Improved Binary ABC (IBABC). IBABC approach is coupled with our proposed rule hiding algorithm, called ABC4ARH, to select sensitive transactions for modification. To choose victim items, ABC4ARH formulates a heuristic. The performance of ABC4ARH algorithm on the side effects is demonstrated using extensive experiments conducted on five real datasets. Furthermore, the effectiveness of IBABC is verified using the uncapacitated facility location problem and 0–1 knapsack problem.
Data sanitization in association rule mining: An analytical review
2018, Expert Systems with Applications
Citation Excerpt :
The compact GA approach (Harik et al., 1999) was applied in cpGA2DT (Lin et al., 2014a) to generate only two individuals per population for competition in order to reduce the memory usage. The MICF (Li & Chang, 2007) initially loads all sensitive transactions into the main memory. Therefore, the transactions are sanitized in the main memory instead of the disk.
Association rule hiding is the process of transforming a transaction database into a sanitized version to protect sensitive knowledge and patterns. The challenge is to minimize the side effects on the sanitized database. Many different sanitization algorithms have been proposed to reach this purpose. This article presents a structured analysis and categorization of the existing challenges and directions for state-of-the-art sanitization algorithms, with highlighting about their characteristics. Fifty-four scientific algorithms, primarily spanning the period 2001–2017, were analyzed and investigated in terms of four aspects including hiding strategy, sanitization technique, sanitization approach, and selection method. In terms of results and findings, this review showed that (i) in comparison to other aspects of sanitization algorithms, the transaction and item selection methods more significantly influence the optimality of hiding process, (ii) blocking technique increases the disclosure risk while distortion technique is better in knowledge protection field, and transaction deletion/insertion technique is a new direction, (iii) heuristic-based algorithms have attracted more attention than other algorithms, especially in the context of hiding the association rules, (iv) a new trend is to use evolutionary paradigm for knowledge hiding that is often integrated with the transaction deletion/insertion technique, and (V) hiding the association rules introduces more challenges than hiding the frequent itemsets in terms of the determination of strategy and formulation of the selection method. This study aims to help researchers and database administrators find recent developments in association rule hiding.
Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining
2016, Engineering Applications of Artificial Intelligence
High-Utility Itemset Mining (HUIM) is an extension of frequent itemset mining, which discovers itemsets yielding a high profit in transaction databases (HUIs). In recent years, a major issue that has arisen is that data publicly published or shared by organizations may lead to privacy threats since sensitive or confidential information may be uncovered by data mining techniques. To address this issue, techniques for privacy-preserving data mining (PPDM) have been proposed. Recently, privacy-preserving utility mining (PPUM) has become an important topic in PPDM. PPUM is the process of hiding sensitive HUIs (SHUIs) appearing in a database, such that the resulting sanitized database will not reveal these itemsets. In the past, the HHUIF and MSICF algorithms were proposed to hide SHUIs, and are the state-of-the-art approaches for PPUM. In this paper, two novel algorithms, namely Maximum Sensitive Utility-MAximum item Utility (MSU-MAU) and Maximum Sensitive Utility-MInimum item Utility (MSU-MIU), are respectively proposed to minimize the side effects of the sanitization process for hiding SHUIs. The proposed algorithms are designed to efficiently delete SHUIs or decrease their utilities using the concepts of maximum and minimum utility. A projection mechanism is also adopted in the two designed algorithms to speed up the sanitization process. Besides, since the evaluation criteria proposed for PPDM are insufficient and inappropriate for evaluating the sanitization performed by PPUM algorithms, this paper introduces three similarity measures to respectively assess the database structure, database utility and item utility of a sanitized database. These criteria are proposed as a new evaluation standard for PPUM.
Secure itemset hiding in smart city sensor data
2024, Cluster Computing
Association rule hiding using integer linear programming
2021, International Journal of Electrical and Computer Engineering

View all citing articles on Scopus

¹: Tel.: +886 4 26328001x18113; fax: +886 4 26324045.

²: Tel.: +886 4 24517250x3790; fax: +886 4 27066495.

View full text

MICF: An effective sanitization algorithm for hiding sensitive patterns on data mining

Abstract

Introduction

Section snippets

Mining association rules

Maximum item conflict first (MICF) method

Sanitization rate

Experimental results

Conclusions

Acknowledgement

Advanced Engineering Informatics

Advanced Engineering Informatics

Discovering Data Mining from Concept to Implementation

Data Mining: Concepts, Models, Methods, and Algorithms

Mining wafer fabrication: framework and challenges

Privacy-preserving distributed mining of association rules on horizontally partitioned data

IEEE Transactions on Knowledge and Data Engineering