MICF: An effective sanitization algorithm for hiding sensitive patterns on data mining

https://doi.org/10.1016/j.aei.2006.12.003Get rights and content

Abstract

Data mining mechanisms have widely been applied in various businesses and manufacturing companies across many industry sectors. Sharing data or sharing mined rules has become a trend among business partnerships, as it is perceived to be a mutually benefit way of increasing productivity for all parties involved. Nevertheless, this has also increased the risk of unexpected information leaks when releasing data. To conceal restrictive itemsets (patterns) contained in the source database, a sanitization process transforms the source database into a released database that the counterpart cannot extract sensitive rules from. The transformed result also conceals non-restrictive information as an unwanted event, called a side effect or the “misses cost”. The problem of finding an optimal sanitization method, which conceals all restrictive itemsets but minimizes the misses cost, is NP-hard. To address this challenging problem, this study proposes the maximum item conflict first (MICF) algorithm. Experimental results demonstrate that the proposed method is effective, has a low sanitization rate, and can generally achieve a significantly lower misses cost than those achieved by the MinFIA, MaxFIA, IGA and Algo2b methods in several real and artificial datasets.

Introduction

Data mining is an interdisciplinary field bringing together techniques to extract information from large databases [1]. Over the past few years, data mining mechanisms have widely been applied in various business organizations (for example, retail, insurance, finance, banking, and communication) and manufacturing companies such as Texas Instruments (fault diagnosis), Ford (harshness, noise, and vibration analysis), Motorola (CDMA Base Station Placement), Boeing (Post-flight Diagnostics), and Kodak (data visualization) [2].

Large companies use powerful data acquisition systems (such as minicomputers, microprocessors, and senor networks) to collect, analyze, and transfer data. The role of knowledge discovery in databases (KDD) and data mining methodologies, therefore, has become extremely important for extracting useful knowledge from huge amounts of raw data [2], [3].

For design procedure capture, which is a specific engineering knowledge that can be captured from the design events monitored during design process [4], data mining techniques can be used in different stages of production. Manufacturing is typically a controlled process such as wafer fabrication. Using mining tools to analyze the collected data can result in efficient strategies to improve the production process, find out the unusual steps during the manufacturing process [5], and improve the reliability of system identification [6].

Data mining is an evolutionary step along the path of problem solving through data analysis [1]. Releasing collected data or mined rules for sharing has become a crucial trend among business partnerships as it results in increased productivity for all companies involved. Nevertheless, the released data have also increased the risk of incurring sensitive information leaks [7]. Organizations should evaluate and decrease the risk of disclosing information. Therefore, developing efficient approaches to maintaining an organization’s competitive edge in business by restricting information leakages has become an important issue. Consider the following two scenarios:

  • Scenario one:

    Some paper manufacturers have their own databases that record their patterns of stock and sale. For their mutual benefit, multiple companies decide to share their databases to cooperatively generate information and trends found in the shared large database. However, each company prefers to keep their own strategic patterns confidential from the others. Thus, a company can both uncover more interesting trends from the combined shared database than that available only from their own database, and can prevent sensitive information from being discovered by other companies [8].

  • Scenario two:

    The captured design procedure knowledge helps companies to understand how experienced designers carry out their designs and guide other designers to design better. Moreover, the knowledge can be used for training novice designers so that they can quickly learn how to execute prominent designs [4]. In order to preserve strategic or sensitive intelligence and still share such knowledge among allied companies, a data sanitization process or privacy-preserving techniques must be applied.

An intelligent system can be developed to achieve this. Moreover, the privacy-preserving data mining will be one of the key techniques in such a system. The mechanisms usually transform the source database into a new one from which sensitive information cannot be extracted. The procedure of transforming the source database into a new database that hides some sensitive rules is called the sanitization process [9].

In the association rule mining, all rules are derived from the frequent itemsets (patterns). Accordingly, one essential and efficient way to protect some restrictive patterns is to decrease their support values, which can be done by deleting or modifying items in several transactions. Such approaches usually follow two restrictions: (1) reduce the impact on the source database; and (2) find an appropriate balance between privacy and knowledge discovery. An itemset which must be concealed is called a restrictive itemset. The impact on a source database of deleting items from transactions can be measured by the sanitization rate. The sanitization rate is defined as the ratio of deleted items to the total support values of restrictive itemsets in the source dataset. The sanitization process can also conceal some non-restrictive itemsets, which is a side effect called the “misses cost”. An optimal sanitization process, which both conceals all restrictive patterns and minimize the misses cost, is an NP-hard problem [7].

The item grouping algorithm (IGA) has been proposed to enforce privacy in mining frequent itemsets [10]. IGA groups restrictive itemsets and assigns a victim item to each group. This approach can reduce the impact on the database if the sanitized transaction contains more than one restrictive itemset.

IGA has a low misses cost, but can be improved further by reducing the number of deleted items. Moreover, it must deal with the overlap between groups. To deal with this problem, this study proposes a novel algorithm called maximum item conflict first (MICF). The degree of conflict of an item in a sensitive transaction is defined as the number of restrictive patterns affected when an item is deleted. MICF selects an item with the maximum degree of conflict for deletion. Therefore, MICF simultaneously decreases the support values for the maximum number of restrictive patterns and reduces the number of items to be removed from the source database.

This study focuses on the task of deleting items from transactions to conceal frequent itemsets in association rule mining and on the issue of no hiding failure. All association rules derivable from these hidden frequent itemsets are thus also hidden. That is, no extra artificial itemset can be generated from the sanitized dataset.

The rest of this paper is organized as follows. Section 2 presents an overview of the current methods for solving the problem of privacy-preserving association rule mining. Section 3 explains the proposed maximum item conflict first (MICF) algorithm for hiding all restrictive itemsets. The time complexity analysis is represented in Section 4. Section 5 provides experimental results and evaluates the performance of the proposed algorithm. Finally, we conclude in Section 6 with a summary of our work.

Section snippets

Mining association rules

The problem of mining association rules have been first presented in 1993 [11]. Recently, mining association rules plays one of the most important roles in data mining. Given a transaction database, mining association rules attempts to discover the significant relationship among items. The formal definition is as follows.

Let DB denote a transaction database, which is a set of transactions. DB = {T1, T2,  , Tz}. Let I = {i1, i2,  , in} be all set of items in DB. Thus, ∀Tq  DB, Tq  I, 1  q  z. Let X be a set

Maximum item conflict first (MICF) method

Given the restrictive itemsets, frequent itemsets, and source database, the goal of the sanitization process is to protect restrictive frequent itemsets against the mining techniques used to discover them. The sanitization process, which decreases the support values of restrictive itemsets by removing items from sensitive transactions essentially, includes four sub-problems:

  • (1)

    Identifying the set of sensitive transactions for each restrictive itemset.

  • (2)

    Selecting the partial sensitive transactions to

Sanitization rate

Definition 4.1

If removing an item from a transaction results in a reduction by one of the support count of some restrictive itemsets, where the current support count of each restrictive itemset rj is greater than |dbrj|×ψ, this removal is called a valid item-sanitization; otherwise it is called an invalid item-sanitization. Clearly, the support count of rj reduces one after the removal.

Theorem 4.1

Let a sanitization process have no invalid item-sanitization. Let rj be an arbitrary restrictive itemset in RI, where 1  j  

Experimental results

To measure the effectiveness of MICF, experiments were conducted on both simulated and real datasets to compare its misses costs and sanitization rates with that of Algo2b, MaxFIA, MinFIA and IGA. All experiments were performed on an AMD Barton ES 2900+ (2000 MHz) PC with 1 GB of main memory, running Windows XP Professional. All algorithms were coded in Visual C++ 6.0, and employed the same array structure to store all restrictive transactions in the main memory.

Given a minimum support threshold

Conclusions

Data mining techniques can discover useful information from databases. Accurate input data leads to meaningful mining results, but problems arise when users provide fictitious data to protect their privacy. In this competitive but also cooperative business environment, companies need to share information with others, while at the same time protecting their own confidential strategic. For this kind of data sharing to be possible, this study proposes the maximum item conflict first (MICF)

Acknowledgement

The authors would like to acknowledge the helpful comments made by the anonymous reviewers of this paper and thank Blue Martini Software, Inc., for kindly providing the BMS datasets.

References (28)

  • Y. Ishino et al.

    An information value based approach to design procedure capture

    Advanced Engineering Informatics

    (2006)
  • S. Saitta et al.

    Data mining techniques for improving the reliability of system identification

    Advanced Engineering Informatics

    (2005)
  • P. Cabena et al.

    Discovering Data Mining from Concept to Implementation

    (1998)
  • M. Kantardzic

    Data Mining: Concepts, Models, Methods, and Algorithms

    (2002)
  • J. Soenjaya et al.

    Mining wafer fabrication: framework and challenges

  • M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, Disclosure limitation of sensitive rules, in:...
  • M. Kantarcioglu et al.

    Privacy-preserving distributed mining of association rules on horizontally partitioned data

    IEEE Transactions on Knowledge and Data Engineering

    (2004)
  • A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, Privacy preserving mining of association rules, in: Proceedings of...
  • S.R.M. Oliveira, O.R. Zaïane, Privacy preserving frequent itemset mining, in: Proceedings of IEEE ICDM Workshop on...
  • R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings...
  • R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of 20th International Conference...
  • J.S. Park, M.S. Chen, P.S. Yu, An effective hash-based algorithm for mining association rules, in: Proceedings of 1995...
  • S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in:...
  • Cited by (19)

    • Privacy preserving rare itemset mining

      2024, Information Sciences
    • Privacy-preserving in association rule mining using an improved discrete binary artificial bee colony

      2020, Expert Systems with Applications
      Citation Excerpt :

      (hiding failure): a sensitive rule is failed to be hidden if the number of the sanitized transactions is not equal to its minimum sanitization ratio. A sanitization process with a low minimum support threshold cannot avoid hiding failure (Li & Chang, 2007; Telikani & Shahbahrami, 2018). (new rule): A weak rule is one that its generative itemset is frequent but its confidence is below βmin.

    • Data sanitization in association rule mining: An analytical review

      2018, Expert Systems with Applications
      Citation Excerpt :

      The compact GA approach (Harik et al., 1999) was applied in cpGA2DT (Lin et al., 2014a) to generate only two individuals per population for competition in order to reduce the memory usage. The MICF (Li & Chang, 2007) initially loads all sensitive transactions into the main memory. Therefore, the transactions are sanitized in the main memory instead of the disk.

    • Association rule hiding using integer linear programming

      2021, International Journal of Electrical and Computer Engineering
    View all citing articles on Scopus
    1

    Tel.: +886 4 26328001x18113; fax: +886 4 26324045.

    2

    Tel.: +886 4 24517250x3790; fax: +886 4 27066495.

    View full text