Privacy preserving itemset mining through noisy items

https://doi.org/10.1016/j.eswa.2008.06.052Get rights and content

Abstract

This work investigates the problem of privacy-preserving mining of frequent itemsets. We propose a procedure to protect the privacy of data by adding noisy items to each transaction. Then, an algorithm is proposed to reconstruct frequent itemsets from these noise-added transactions. The experimental results indicate that this method can achieve a rather high level of accuracy. Our method utilizes existing algorithms for frequent itemset mining, and thereby takes full advantage of their progress to mine frequent itemset efficiently.

Introduction

Owing to the rapid progress in information technologies, companies can nowadays collect and store huge amounts of data. With this large volume of data, manual analysis is no longer feasible. Instead, automatic or semiautomatic tools that employ data mining techniques have been widely used to support data analysis.

In a nutshell, data mining is the process of discovering patterns in data, and the quality of data is crucial for the success of a data mining process. However, as the public becomes more concerned with privacy, more and more people are unwilling to provide real personal data when asked to do so (Eisenberg, 2002). Besides, companies that wish to use their customers’ data for data mining cannot easily do so without compromising their customers’ privacy. Agrawal has suggested that data mining techniques that incorporate privacy concerns (Agrawal & Srikant, 2000) be developed to resolve this dilemma. More specifically, Agrawal and Srikant have proposed a privacy-preserving technique that builds a decision-tree classifier without access to precise information in individual data records (Agrawal & Srikant, 2000). Their primary idea is to perturb individual records through simple probabilistic distortion, so that the privacy of data can be protected. In this way, people can rest assured that only a distorted version of their data is collected by the companies. Agrawal and Srikant describe how to mine the distorted data, not the original data. Therefore, both data privacy and data mining can be achieved simultaneously.

Association rule mining (Agrawal, Imielinski, & Swami, 1993) is one of the most widely-used data mining techniques. Several privacy-preserving techniques for association rule mining have also been proposed in the past few years (Evfimievski et al., 2002, Lin and Liu, 2007, Rizvi and Haritsa, 2002). The approach stated in Lin and Liu (2007) fabricates many transactions and mixes them with the real transactions. Consequently, it is difficult to determine which transaction is real and which is fake. This approach, in a sense, adds noise between transactions to protect the privacy of data, while the approaches stated in Evfimievski et al., 2002, Rizvi and Haritsa, 2002 take a completely different approach by adding noise within each transaction. Some items are added to, and some items are removed from, each transaction. It is difficult, if not impossible, to determine which item was in the original transaction. Therefore, the privacy of data is ensured.

Because the approaches stated in Evfimievski et al., 2002, Rizvi and Haritsa, 2002 remove some items from each transaction, the support of an itemset in the randomized transactions may be smaller than that in the original transactions. Therefore, the support of a frequent itemset in the randomized transactions may be less than the minimum support. This property excludes using the original minimum support as a threshold to filter out frequent itemsets, and thus makes it hard for (Evfimievski et al., 2002, Rizvi and Haritsa, 2002) to utilize existing tools for association rule mining. Furthermore, when reconstructing the support of an itemset, these approaches must know the supports of subsets of the itemset. This poses little difficulty when modifying traditional level-wise association rule mining algorithms, such as Apriori (Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996), but it is hard to utilize more efficient algorithms that are not level-wise, such as FP-growth (Han, Pei, & Yin, 2000).

This paper proposes a new approach to randomizing transactions by adding some items to each transaction, but by removing no item from any transaction. This approach first uses any off-the-shelf association rule mining tool with the original minimum support to filter out all possible frequent itemsets from the randomized transactions. Then, it reconstructs the support of each possible frequent itemset level by level to find frequent itemsets. Thus, this approach can take full advantage of the progress made in association rule mining algorithms to work efficiently.

In addition, the approaches stated in Evfimievski et al., 2002, Rizvi and Haritsa, 2002 use a random number generator that generates 0 or 1, respectively, with probabilities p and 1-p, to decide whether an item should be added to a transaction. This tends to add fewer items to longer transactions, which contradicts the intuition that longer transactions contain more information and thus more items should be added to perturb the data. This motivates a new way of adding items to transactions to protect the privacy of data. In fact, for an item that occurs more frequently in the original transactions, the approaches stated in Evfimievski et al., 2002, Rizvi and Haritsa, 2002 have a higher frequency of removing this item from transactions, but a lower frequency of adding this item to transactions. This paper describes a technique that uses a queue and a random number generator to generate the items such that each item has an approximately equal frequency of being added to transactions. The experimental results indicate that this method helps improve the accuracy of the reconstructed support of an itemset.

The primary contribution of this work is to introduce a new approach for privacy-preserving association rule mining. This approach utilizes existing association rule mining tools and is thus quite easy to implement. The rest of this paper is organized as follows: Section 2 discusses related work. Section 3 proposes a method of using noisy items to protect the privacy of data, and Section 4 considers how to reconstruct the support of an itemset. Section 5 proposes an algorithm for mining association rules. Section 6 presents the results of the experiments, and Section 7 concludes this paper.

Section snippets

Related work

Privacy-preserving mining of association rules has attracted considerable attention over the past few years (Atallah et al., 1999, Dasseni et al., 2001, Vaidya and Clifton, 2002, Evfimievski et al., 2002, Kantarcioglu and Clifton, 2002, Lin and Liu, 2007, Rizvi and Haritsa, 2002, Saygin and Verykios, 2002, Saygin et al., 2001, Oliveira and Zaiane, 2002, Wang et al., 2007). Of these references, (Evfimievski et al., 2002, Lin and Liu, 2007, Rizvi and Haritsa, 2002) are the most relevant to this

Adding noisy items

This work only focuses on the task of finding frequent itemsets in association rule mining. The problem can be briefly described as follows:

Definition 1

Agrawal et al., 1993

Let I be a set of n items, where I={a1,a2,,an}. Let T be a sequence of N transactions: T=(t1,t2,,tN) where each transaction ti is a subset of I. Given an itemset AI, the support of A is defined assuppT(A)=#{tT|At}N.

If suppT(A)smin, then we refer to A as a frequent itemset in T, where smin is a user-defined parameter called minimum support.

In what

Reconstructing the supports of itemsets

After noisy items are added to each transaction, the next problem is concerned with how to reconstruct the support of an itemset in the original transactions from the randomized transactions. Let Ak be a k-itemset, and s and s be the support of Ak in the original transactions and in the randomized transactions, respectively. The way in which s is derived from s is shown below.

If a transaction does not contain any item in an itemset, then this transaction is said to oppose the itemset. Let t

Mining algorithm

Let T and T, respectively, denote the set of original transactions and the set of noise-added transactions, and smin denote the minimum support. The algorithm that discovers frequent itemsets in T by mining T is shown below:

  • (1)

    Mine the frequent itemsets in T with minimum support smin using any off-the-shelf tool, and store the frequent itemsets in the set F;

  • (2)

    Scan T to collect the number of transactions at each length, and then use Eq. (3) to transform the results to N1,N2,,Nm, where m is min

The datasets

Our experiments were carried out on several synthetic datasets and two real datasets. All synthetic datasets were generated from the IBM Almaden generator and follow the naming convention of Agrawal and Srikant (1994). The two real datasets are the BMS-WebView-1 and BMS-WebView-2 (denoted by BMS1 and BMS2) which are placed in the public domain by Blue Martini Software (Zheng, Kohavi, & Mason, 2001). The real datasets contain clickstream and purchase data from a legware and legcare web retailer.

Conclusions

This paper proposes a new method for the privacy-preserving mining of association rules. The main advantages of this method are:

  • It reconstructs frequent itemsets with a high degree of accuracy. This is especially true for datasets with a large number of distinct items.

  • It integrates any off-the-shelf association rule mining techniques to simplify the implementation.

  • It adds more items to longer transactions to provide better protection.

  • It requires less space for randomized data. For most

References (22)

  • S.-L. Wang et al.

    Hiding informative association rule sets

    Expert Systems with Applications

    (2007)
  • Agrawal, D., & Aggarwal, C. C. (2001). On the design and quantification of privacy preserving data mining algorithms....
  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th...
  • Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In W. Chen, J. Naughton, & P. A. Bernstein (Eds.),...
  • Agrawal, R., Imielinski, T., Swami, & A. N. (1993). Mining association rules between sets of items in large databases,...
  • R. Agrawal et al.

    Fast discovery of association rules

  • Atallah, M., Bertino, E., Elmagarmid, A. K., Ibrahim, M., & Verykios, V.S. (1999). Disclosure limitation of sensitive...
  • Dasseni, E., Verykios, V. S., Elmagarmid, A. K., & Bertino, E. (2001). Hiding association rules by using confidence and...
  • Eisenberg, A. (2002). With false numbers, data crunchers try to mine the truth, The New York...
  • Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J. (2002). Privacy preserving mining of association rules. In D....
  • Evfimievski, A., Gehrke, J., & Srikant, R. (2003). Limiting privacy breaches in privacy preserving data mining. In...
  • Cited by (15)

    • Data sanitization in association rule mining: An analytical review

      2018, Expert Systems with Applications
      Citation Excerpt :

      Protection methods can be classified into two main categories, (i) data hiding and (ii) knowledge hiding. Data hiding methods modify sensitive raw data using randomization techniques (Agrawal & Srikant, 2000; Evfimievski et al., 2004; Lin & Liu, 2007; Rizvi & Haritsa, 2002; Lin & Cheng, 2009) or modify quasi-identifiers using anonymization techniques to obscure the record owner (Samarati, 2001; Sweeney, 2002; Hajian et al., 2014), irrespective of the kind of analysis that is performed by the third party (Prakash & Singaravel, 2015). The quasi-identifiers attributes are those that cannot potentially identify record owner alone, but if they are combined together, might unambiguously identify the entity such as age and zip code (Fung et al., 2010; Hajian et al., 2014).

    • Genetic algorithm-based clustering approach for k-anonymization

      2009, Expert Systems with Applications
      Citation Excerpt :

      Protection of public released microdata from individual identification becomes increasingly important as the public becomes increasingly concerned with privacy. Most privacy protection techniques work by randomizing (Agrawal & Srikant, 2000; Lin & Cheng, 2009) or generalizing Samarati, 2001 the original data, but can also degrade the quality of the data. Therefore, a dilemma exists between data quality and data privacy.

    • Evaluation of sensitive data hiding techniques for transaction databases

      2018, ACM International Conference Proceeding Series
    • Privacy preservation in association rule mining using improved diffie Hellman algorithm

      2018, Journal of Advanced Research in Dynamical and Control Systems
    View all citing articles on Scopus
    View full text