Abstract
Frequent itemset mining is the basis of discovering transaction relationships and providing information services such as recommendation. However, when transaction databases contain individual sensitive information, direct release of frequent itemsets and their supports might bring privacy risks to users. Differential privacy provides strict protection for users, it can distort the sensitive data when attackers get the sensitive data from statistical information. The transaction length is related to sensitivity for counting occurrences (SCO) in a transaction database, larger SCO will reduce the availability of frequent itemsets under ε-differential privacy. So it is necessary to truncate some long transactions in transaction databases. We propose the algorithm FI-DPTT, a quality function is designed to calculate the optimal transaction length in exponential mechanism (EM), it aims to minimize noisy supports. Experimental results show that the proposed algorithm improves the availability and privacy efficiently.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Frequent itemset mining
- Differential privacy
- Exponential mechanism
- Quality function
- Laplace mechanism
- Transaction truncation
1 Introduction
Frequent itemset mining can find valuable knowledge from mass data, but mining sensitive data may reveal individual privacy. For example, analysis of search logs can acquire the behavior of user’s page click, then get their interests in privacy. Therefore, it is necessary to introduce privacy protection mechanism into frequent itemset mining.
Differential privacy [1, 2] is a privacy protection technology that adds noise to query request or analysis results, it is not affected by attacker’s background knowledge, and guarantees that adding or removing one transaction has little effect on the query results.
The research of frequent itemset mining algorithm has made great progress with differential privacy. Bhaskar et al. [3] applied Laplace mechanism (LM) to compute noisy supports of all possible frequent itemsets, and then publish the top-k frequent itemsets with the highest noisy supports. Zeng et al. [4] analyze the effect of transaction length on global sensitivity, then they propose transaction truncating and heuristic method. Zhang et al. [5] adopt EM to select the top-k frequent itemsets. In order to boost availability of the noisy supports, they propose the technique of consistency constraints.
An effective frequent itemset mining algorithm with differential privacy should guarantee a certain privacy, then it tries to improve the availability of frequent itemsets. According to SCO, the transaction length is proportional to Laplace noise, how to reduce the length of long transactions is the key point for a transaction database, the approach reduces some noisy errors, but it results in loss of items and brings more truncation errors at the same time. So the challenge is how to balance both noisy errors and truncation errors, the main contributions of this paper are as follows.
-
(1)
In order to improve privacy protection of frequent itemsets, we propose the algorithm FI-DPTT, it perturbs real supports of top-k frequent itemsets by Laplace noise.
-
(2)
In order to improve the availability of frequent itemsets under differential privacy, we propose a quality function which balances both noisy errors and truncation errors in EM, it draws on the idea of Median to find the optimal transaction length.
2 Preliminaries
2.1 Differential Privacy
Definition 1 (Neighboring Databases).
Two transaction databases D1 and D2 are neighboring databases, if and only if we can obtain one from the other by adding or removing one transaction, such that \( |\text{D}_{1} - \text{D}_{2} |\; = \;1 \).
Definition 2
(ε-Differential Privacy [1]). Let be an algorithm of privacy protection,
satisfies ε-differential privacy, if and only if for any pair of neighboring databases D1 and D2, and any output O of
, we have:

In the above definition, denotes that
outputs the probability of being O, ε is called the privacy budget, which controls the strength of privacy protection. A smaller ε leads to stricter privacy protection and vice versa.
2.2 Noisy Mechanism
Definition 3
(Global Sensitivity [1]). Given a query function Q with numerical outputs O, the global sensitivity of Q is ΔQ:
D1 and D2 are arbitrary neighboring databases, ΔQ denotes the most distance between Q(D1) and Q(D2), global sensitivity is independent for arbitrary transaction databases.
Definition 4
(Sensitivity for Counting Occurrences (SCO) [6]). Given a transaction database D with the longest transaction length lmax, then for a query Q = {p1, p2, …, pn} which for each itemset pi of length in the range I = [Qmin, Qmax] computes the number of occurrences in D, global sensitivity ΔQ = ΔI × lmax, where ΔI = Qmax – Qmin + 1.
SCO is proportional to the maximum transaction length from Definition 4. If there is an only one long transaction, we need add much Laplace noise to frequent itemsets.
Definition 5
(Laplace Mechanism (LM) [7]). Given a query Q(D) → O, if the output of algorithm satisfies Eq. (3), then the
enforces ε-differential privacy.

Lapi(ΔQ/ε)(1≤i≤n) is independent Laplace noise mutually, The Laplace parameter is ΔQ/ε, the Laplace noise is proportional to ΔQ and inversely proportional to ε. The idea is that we add Laplace noise to the real output values for privacy protection.
Definition 6
(Exponential Mechanism (EM) [8]). We design a quality function u(p, D), if algorithm satisfies Eq. (4), then algorithm
enforces ε-differential privacy.

Where Δu denotes global sensitivity of quality function u(p, D). The key point is how to design a quality function u(p, D), p denotes the selected items from the output fields O. A larger \( \exp \left( {\frac{\varepsilon \times u(p,\;D)}{2 \times \Delta u}} \right) \) leads to higher probability that is selected as output.
2.3 Availability Analysis
Definition 7
(False Negative Rate (FNR) [5]). Let TPk(D) be top-k frequent itemsets in the database D, FNR measures the ratio that the real top-k frequent itemsets are in TPk(D) and not in TPk(Dt). A smaller FNR leads to higher data accuracy.
Definition 8
(Average Relative Error (ARE) [5]). It measures the errors that we add Laplace noise to top-k frequent itemsets in database D. Where TC(pi, TPk(D)) denotes real supports of the frequent itemset pi in database D. NC(pi, TPk(Dt)) denotes noisy supports of frequent itemset pi, If pi is not in TPk(Dt), we set NC(pi, TPk(Dt)) = 0. A smaller ARE leads to higher data accuracy.
3 Proposed Algorithm
3.1 Idea of Transaction Truncation
We define the optimal transaction length. Total errors are the sum of noisy errors and truncation errors, we truncate an original transaction database D into the transaction database Dt, the total errors which we generate frequent itemsets in the Dt under ε-differential privacy are the smallest than any other truncated database, so the longest transaction length in the database Dt is the optimal transaction length in the database D.
3.2 Algorithm Description

In order to reduce truncation errors, Apriori method is performed first to get candidates of 1-frequent itemsets and their supports, and then items of each transaction is ranked in descending order with supports to get the database \( \text{D}' \) (Step 1), when we truncate a transaction database. ε (Step 2) is allocated to two steps ε1 (Step 3) and ε2 (Step 5) on average. The database \( \text{D}' \) is truncated into Dt by lopt (Step 4).
3.3 Interpretation of Important Processes
For the algorithm FI-DPTT, two important procedures are interpreted as follows.

Procedure SelectOptLen draws on the characteristic of Median [9] that describes the trend of transaction records, it is rarely influenced by extreme values. We scan the database \( \text{D}' \) to obtain length of each transaction, then adopt EM to get lopt. A quality function \( \text{u}(\text{t},\;\text{D}')\; = \;\frac{{\text{count}_{\text{t}} }}{{|\text{rank(t) - SCALE} \times \text{|D'||}|}} \). If rank(t) = SCALE × |\( \text{D}' \)|, we set u(t, \( \text{D}' \)) = 2×countt, countt denotes the supports of the last item in the current transaction record, rank(t) denotes the location where t is ranked in ascending order from the database \( \text{D}' \). Δu(t, \( \text{D}' \)) is affected one at most. Because we add or remove one transaction record from the database \( \text{D}' \), the global sensitivity of u(t, \( \text{D}' \)) is one, that is, Δu(t, D′) = 1.

Procedure Perturb-Frequency generates frequent itemsets and add Laplace noise to real supports. Let c(pi) is real supports of a frequent itemset pi, ct(pi) is the supports that is added Laplace noise. lopt is the global sensitivity of frequent itemsets in the database Dt.
4 Experimental Evaluation
4.1 Experimental Setting
This section evaluates FI-DPTT algorithm on the data availability with DP-topkP [5]. Experimental environment is Inter Core i5-2410 M, CPU 2.30 GHz, 4 GB memory, Windows 7 and datasets PUMSB-STAR, RETAIL and KOSARAK [10]. FNR and ARE are used for data analysis. We repeat the experiment for five times and get the average (Table 1).
4.2 Experimental Result Analysis
We fix k = 100 and ε = 1.0 to analyze the impact of SCALE on availability. When we set SCALE = 0.85 in Fig. 1, it ensures the best availability. A smaller SCALE leads to increase in truncation errors and reduce in noisy errors, total errors tends to increase and vice versa. Furthermore, the effect of truncation errors is greater than noisy errors on availability. We fix SCALE = 0.85 in the follow-up experiments.
We fix k = 100 to analyze the impact of ε on availability in Fig. 2. When ε < 1, FI-DPTT achieves lower FNR than DP-topkP, because we give priority to reducing truncation errors, FNR is only related to truncation errors. FI-DPTT achieves lower ARE than DP-topkP, ARE is related to both truncation errors and noisy errors, it will be larger than FNR. It shows that the availability of FI-DPTT is better than DP-topkP.
From Fig. 3, we fix ε = 1 to analyze the impact of k on availability. With the increase of k, the availability of two algorithms will reduce, because it leads to smaller threshold λ, it makes both truncation errors and noisy errors increase.
5 Conclusion
If there are some long transactions in a transaction database, it makes the availability of frequent itemsets reduced under differential privacy. The algorithm FI-DPTT combines exponential mechanism with Laplace mechanism. In order to improve the availability of frequent itemsets under differential privacy, a quality function of exponential mechanism is designed to balance truncation errors and noisy errors, then Laplace noise is added to the real supports of frequent itemsets. The proposed algorithm can gain better performance on both data availability and privacy.
References
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Bhaskar, R., Laxman, S., Thakurta, A.: Discovering frequent patterns in sensitive data. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2010 DBLP, pp. 503–512 (2010)
Zeng, C., Naughton, J.F., Cai, J.Y.: On differentially private frequent itemset mining. VLDB J. 6(1), 25–36 (2012)
Zhang, X., Miao, W., Meng, X.: An accurate method for mining top-k frequent pattern under differential privacy. J. Comput. Res. Develop. 51(1), 104–114 (2014)
Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: ACM International Conference on Information & Knowledge Management, pp. 269–278. ACM (2013)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Mcsherry, F., Talwar, K.: Mechanism design via differential privacy. In: Foundations of Computer Science 2007, FOCS 2007, pp. 94–103. IEEE (2007)
Guoqing, L., Xiaojian, Z., Liping, D.: Frequent sequential pattern mining under differential privacy. J. Comput. Res. Develop. 52(12), 2789–2801 (2015)
Datasets. http://fimi.ua.ac.be/data/
Acknowledgments
This work is funded by Chongqing Natural Science Foundation (cstc2014kjrc-qnrc40002), Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ1500431, KJ1400429).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Xia, Y., Huang, Y., Zhang, X., Bae, H. (2018). Frequent Itemset Mining with Differential Privacy Based on Transaction Truncation. In: Qing, S., Mitchell, C., Chen, L., Liu, D. (eds) Information and Communications Security. ICICS 2017. Lecture Notes in Computer Science(), vol 10631. Springer, Cham. https://doi.org/10.1007/978-3-319-89500-0_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-89500-0_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89499-7
Online ISBN: 978-3-319-89500-0
eBook Packages: Computer ScienceComputer Science (R0)