Frequent Itemset Mining with Differential Privacy Based on Transaction Truncation

Xia, Ying; Huang, Yu; Zhang, Xu; Bae, HaeYoung

doi:10.1007/978-3-319-89500-0_38

Ying Xia¹⁷,
Yu Huang¹⁷,
Xu Zhang¹⁷ &
…
HaeYoung Bae¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10631))

Included in the following conference series:

International Conference on Information and Communications Security

2550 Accesses

Abstract

Frequent itemset mining is the basis of discovering transaction relationships and providing information services such as recommendation. However, when transaction databases contain individual sensitive information, direct release of frequent itemsets and their supports might bring privacy risks to users. Differential privacy provides strict protection for users, it can distort the sensitive data when attackers get the sensitive data from statistical information. The transaction length is related to sensitivity for counting occurrences (SCO) in a transaction database, larger SCO will reduce the availability of frequent itemsets under ε-differential privacy. So it is necessary to truncate some long transactions in transaction databases. We propose the algorithm FI-DPTT, a quality function is designed to calculate the optimal transaction length in exponential mechanism (EM), it aims to minimize noisy supports. Experimental results show that the proposed algorithm improves the availability and privacy efficiently.

You have full access to this open access chapter, Download conference paper PDF

Privacy Preservation of Periodic Frequent Patterns Using Sensitive Inverse Frequency

Mining Representative Patterns Under Differential Privacy

Differentially Private Frequent Itemset Mining Against Incremental Updates

Keywords

1 Introduction

Frequent itemset mining can find valuable knowledge from mass data, but mining sensitive data may reveal individual privacy. For example, analysis of search logs can acquire the behavior of user’s page click, then get their interests in privacy. Therefore, it is necessary to introduce privacy protection mechanism into frequent itemset mining.

Differential privacy [1, 2] is a privacy protection technology that adds noise to query request or analysis results, it is not affected by attacker’s background knowledge, and guarantees that adding or removing one transaction has little effect on the query results.

The research of frequent itemset mining algorithm has made great progress with differential privacy. Bhaskar et al. [3] applied Laplace mechanism (LM) to compute noisy supports of all possible frequent itemsets, and then publish the top-k frequent itemsets with the highest noisy supports. Zeng et al. [4] analyze the effect of transaction length on global sensitivity, then they propose transaction truncating and heuristic method. Zhang et al. [5] adopt EM to select the top-k frequent itemsets. In order to boost availability of the noisy supports, they propose the technique of consistency constraints.

An effective frequent itemset mining algorithm with differential privacy should guarantee a certain privacy, then it tries to improve the availability of frequent itemsets. According to SCO, the transaction length is proportional to Laplace noise, how to reduce the length of long transactions is the key point for a transaction database, the approach reduces some noisy errors, but it results in loss of items and brings more truncation errors at the same time. So the challenge is how to balance both noisy errors and truncation errors, the main contributions of this paper are as follows.

(1)
In order to improve privacy protection of frequent itemsets, we propose the algorithm FI-DPTT, it perturbs real supports of top-k frequent itemsets by Laplace noise.
(2)
In order to improve the availability of frequent itemsets under differential privacy, we propose a quality function which balances both noisy errors and truncation errors in EM, it draws on the idea of Median to find the optimal transaction length.

2 Preliminaries

2.1 Differential Privacy

Definition 1 (Neighboring Databases).

Two transaction databases D₁ and D₂ are neighboring databases, if and only if we can obtain one from the other by adding or removing one transaction, such that $ |\text{D}_{1} - \text{D}_{2} |\; = \;1 $.

Definition 2

(ε-Differential Privacy [1]). Let be an algorithm of privacy protection, satisfies ε-differential privacy, if and only if for any pair of neighboring databases D₁ and D₂, and any output O of , we have:

(1)

In the above definition, denotes that outputs the probability of being O, ε is called the privacy budget, which controls the strength of privacy protection. A smaller ε leads to stricter privacy protection and vice versa.

2.2 Noisy Mechanism

Definition 3

(Global Sensitivity [1]). Given a query function Q with numerical outputs O, the global sensitivity of Q is ΔQ:

$$ \Delta \text{Q}\;\text{ = }\;\text{max}_{{\text{D}_{1} ,\;\text{D}_{2} }} \;\left| {\text{Q}(\text{D}_{1} )\; - \;\text{Q(D}_{2} \text{)}} \right| $$

(2)

D₁ and D₂ are arbitrary neighboring databases, ΔQ denotes the most distance between Q(D₁) and Q(D₂), global sensitivity is independent for arbitrary transaction databases.

Definition 4

(Sensitivity for Counting Occurrences (SCO) [6]). Given a transaction database D with the longest transaction length l_max, then for a query Q = {p₁, p₂, …, p_n} which for each itemset p_i of length in the range I = [Q_min, Q_max] computes the number of occurrences in D, global sensitivity ΔQ = ΔI × l_max, where ΔI = Q_max – Q_min + 1.

SCO is proportional to the maximum transaction length from Definition 4. If there is an only one long transaction, we need add much Laplace noise to frequent itemsets.

Definition 5

(Laplace Mechanism (LM) [7]). Given a query Q(D) → O, if the output of algorithm satisfies Eq. (3), then the enforces ε-differential privacy.

(3)

Lap_i(ΔQ/ε)(1≤i≤n) is independent Laplace noise mutually, The Laplace parameter is ΔQ/ε, the Laplace noise is proportional to ΔQ and inversely proportional to ε. The idea is that we add Laplace noise to the real output values for privacy protection.

Definition 6

(Exponential Mechanism (EM) [8]). We design a quality function u(p, D), if algorithm satisfies Eq. (4), then algorithm enforces ε-differential privacy.

(4)

Where Δu denotes global sensitivity of quality function u(p, D). The key point is how to design a quality function u(p, D), p denotes the selected items from the output fields O. A larger $ \exp \left( {\frac{\varepsilon \times u(p,\;D)}{2 \times \Delta u}} \right) $ leads to higher probability that is selected as output.

2.3 Availability Analysis

Definition 7

(False Negative Rate (FNR) [5]). Let TP_k(D) be top-k frequent itemsets in the database D, FNR measures the ratio that the real top-k frequent itemsets are in TP_k(D) and not in TP_k(D_t). A smaller FNR leads to higher data accuracy.

$$ \text{FNR} = \;\frac{{|\text{TP}_{\text{k}} (\text{D})\; \cup \;\text{TP}_{\text{k}} (\text{D}_{\text{t}} ) - \;\text{TP}_{\text{k}} (\text{D}_{\text{t}} )|}}{\text{k}} $$

(5)

Definition 8

(Average Relative Error (ARE) [5]). It measures the errors that we add Laplace noise to top-k frequent itemsets in database D. Where TC(p_i, TP_k(D)) denotes real supports of the frequent itemset p_i in database D. NC(p_i, TP_k(D_t)) denotes noisy supports of frequent itemset p_i, If p_i is not in TP_k(D_t), we set NC(p_i, TP_k(D_t)) = 0. A smaller ARE leads to higher data accuracy.

$$ \text{ARE}\; = \;\frac{{\sum\nolimits_{{\text{P}_{\text{i}} \; \in \;\text{TP}_{\text{k}} (\text{D})}} {\tfrac{{|\text{TC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D))}\; - \;\text{NC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D}_{\text{t}} \text{)) |}}}{{\text{TC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D))}}}} }}{\text{k}} $$

(6)

3 Proposed Algorithm

3.1 Idea of Transaction Truncation

We define the optimal transaction length. Total errors are the sum of noisy errors and truncation errors, we truncate an original transaction database D into the transaction database D_t, the total errors which we generate frequent itemsets in the D_t under ε-differential privacy are the smallest than any other truncated database, so the longest transaction length in the database D_t is the optimal transaction length in the database D.

3.2 Algorithm Description

In order to reduce truncation errors, Apriori method is performed first to get candidates of 1-frequent itemsets and their supports, and then items of each transaction is ranked in descending order with supports to get the database $ \text{D}' $ (Step 1), when we truncate a transaction database. ε (Step 2) is allocated to two steps ε₁ (Step 3) and ε₂ (Step 5) on average. The database $ \text{D}' $ is truncated into D_t by l_opt (Step 4).

3.3 Interpretation of Important Processes

For the algorithm FI-DPTT, two important procedures are interpreted as follows.

Procedure SelectOptLen draws on the characteristic of Median [9] that describes the trend of transaction records, it is rarely influenced by extreme values. We scan the database $ \text{D}' $ to obtain length of each transaction, then adopt EM to get l_opt. A quality function $ \text{u}(\text{t},\;\text{D}')\; = \;\frac{{\text{count}_{\text{t}} }}{{|\text{rank(t) - SCALE} \times \text{|D'||}|}} $. If rank(t) = SCALE × |$ \text{D}' $|, we set u(t, $ \text{D}' $) = 2×count_t, count_t denotes the supports of the last item in the current transaction record, rank(t) denotes the location where t is ranked in ascending order from the database $ \text{D}' $. Δu(t, $ \text{D}' $) is affected one at most. Because we add or remove one transaction record from the database $ \text{D}' $, the global sensitivity of u(t, $ \text{D}' $) is one, that is, Δu(t, D′) = 1.

Procedure Perturb-Frequency generates frequent itemsets and add Laplace noise to real supports. Let c(p_i) is real supports of a frequent itemset p_i, c_t(p_i) is the supports that is added Laplace noise. l_opt is the global sensitivity of frequent itemsets in the database D_t.

4 Experimental Evaluation

4.1 Experimental Setting

This section evaluates FI-DPTT algorithm on the data availability with DP-topkP [5]. Experimental environment is Inter Core i5-2410 M, CPU 2.30 GHz, 4 GB memory, Windows 7 and datasets PUMSB-STAR, RETAIL and KOSARAK [10]. FNR and ARE are used for data analysis. We repeat the experiment for five times and get the average (Table 1).

Table 1. Description of three datasets.

Full size table

4.2 Experimental Result Analysis

We fix k = 100 and ε = 1.0 to analyze the impact of SCALE on availability. When we set SCALE = 0.85 in Fig. 1, it ensures the best availability. A smaller SCALE leads to increase in truncation errors and reduce in noisy errors, total errors tends to increase and vice versa. Furthermore, the effect of truncation errors is greater than noisy errors on availability. We fix SCALE = 0.85 in the follow-up experiments.

We fix k = 100 to analyze the impact of ε on availability in Fig. 2. When ε < 1, FI-DPTT achieves lower FNR than DP-topkP, because we give priority to reducing truncation errors, FNR is only related to truncation errors. FI-DPTT achieves lower ARE than DP-topkP, ARE is related to both truncation errors and noisy errors, it will be larger than FNR. It shows that the availability of FI-DPTT is better than DP-topkP.

From Fig. 3, we fix ε = 1 to analyze the impact of k on availability. With the increase of k, the availability of two algorithms will reduce, because it leads to smaller threshold λ, it makes both truncation errors and noisy errors increase.

5 Conclusion

If there are some long transactions in a transaction database, it makes the availability of frequent itemsets reduced under differential privacy. The algorithm FI-DPTT combines exponential mechanism with Laplace mechanism. In order to improve the availability of frequent itemsets under differential privacy, a quality function of exponential mechanism is designed to balance truncation errors and noisy errors, then Laplace noise is added to the real supports of frequent itemsets. The proposed algorithm can gain better performance on both data availability and privacy.

References

Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter MATH Google Scholar
Bhaskar, R., Laxman, S., Thakurta, A.: Discovering frequent patterns in sensitive data. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2010 DBLP, pp. 503–512 (2010)
Google Scholar
Zeng, C., Naughton, J.F., Cai, J.Y.: On differentially private frequent itemset mining. VLDB J. 6(1), 25–36 (2012)
Google Scholar
Zhang, X., Miao, W., Meng, X.: An accurate method for mining top-k frequent pattern under differential privacy. J. Comput. Res. Develop. 51(1), 104–114 (2014)
Google Scholar
Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: ACM International Conference on Information & Knowledge Management, pp. 269–278. ACM (2013)
Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Mcsherry, F., Talwar, K.: Mechanism design via differential privacy. In: Foundations of Computer Science 2007, FOCS 2007, pp. 94–103. IEEE (2007)
Google Scholar
Guoqing, L., Xiaojian, Z., Liping, D.: Frequent sequential pattern mining under differential privacy. J. Comput. Res. Develop. 52(12), 2789–2801 (2015)
Google Scholar
Datasets. http://fimi.ua.ac.be/data/

Download references

Acknowledgments

This work is funded by Chongqing Natural Science Foundation (cstc2014kjrc-qnrc40002), Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ1500431, KJ1400429).

Author information

Authors and Affiliations

Research Center of Spatial Information System, Chongqing University of Posts and Telecommunications, Chongqing, China
Ying Xia, Yu Huang, Xu Zhang & HaeYoung Bae

Authors

Ying Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
HaeYoung Bae
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Huang .

Editor information

Editors and Affiliations

Chinese Academy of Sciences and Peking University, Beijing, China
Sihan Qing
Royal Holloway, University of London, Egham, Surrey, United Kingdom
Chris Mitchell
University of Surrey, Guildford, Surrey, United Kingdom
Liqun Chen
Microsoft, Beijing, China
Dongmei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, Y., Huang, Y., Zhang, X., Bae, H. (2018). Frequent Itemset Mining with Differential Privacy Based on Transaction Truncation. In: Qing, S., Mitchell, C., Chen, L., Liu, D. (eds) Information and Communications Security. ICICS 2017. Lecture Notes in Computer Science(), vol 10631. Springer, Cham. https://doi.org/10.1007/978-3-319-89500-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-89500-0_38
Published: 10 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89499-7
Online ISBN: 978-3-319-89500-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics