Abstract
In this paper, we study the problem of mining frequent itemsets in high-dimensional databases with differential privacy, and propose a novel algorithm, PrivBUD-Wise, which achieves high result utility as well as a high privacy level. Instead of limiting the cardinality of transactions by truncating or splitting approaches, which causes extra information loss and result in unsatisfactory performance in utility, PrivBUD-Wise doesn’t make any preprocessing on original database and guarantees high result utility by reducing extra \(privacy\ budget\) consumption on irrelevant itemsets as much as possible. To achieve that, we first propose a Report Noisy mechanism with optional number of reported itemsets: SRNM, and what is more important is that we give a strict proof for SRNM in the appendix. Moreover, PrivBUD-Wise first proposes a biased \(privacy\ budget\) allocation strategy and no assumption or estimation on the maximal cardinality needs to be made. The good performance in utility and efficiency of PrivBUD-Wise is shown by experiments on three real-world datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)
Cheng, X., Su, S., Xu, S., Li, Z.: DP-Apriori: a differentially private frequent itemset mining algorithm based on transaction splitting. Comput. Secur. 50, 74–90 (2015)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)
Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)
Fanaeepour, M., Machanavajjhala, A.: PrivStream: differentially private event detection on data streams. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 145–147. ACM (2019)
Fournier-Viger, P., Lin, J.C.-W., Vo, B., Chi, T.T., Zhang, J., Le, H.B.: A survey of itemset mining. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 7(4), e1207 (2017)
Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)
Lee, J., Clifton, C.W.: Top-k frequent itemsets via differentially private FP-trees. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 931–940. ACM (2014)
Li, N., Qardaji, W., Su, D., Cao, J.: PrivBasis: frequent itemset mining with differential privacy. Proc. VLDB Endow. 5(11), 1340–1351 (2012)
Li, S., Mu, N., Le, J., Liao, X.: Privacy preserving frequent itemset mining: maximizing data utility based on database reconstruction. Comput. Secur. 84, 17–34 (2019)
Wang, N., Xiao, X., Yang, Y., Zhang, Z., Gu, Y., Yu, G.: PrivSuper: a superset-first approach to frequent itemset mining under differential privacy. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 809–820. IEEE (2017)
Zeng, C., Naughton, J.F., Cai, J.-Y.: On differentially private frequent itemset mining. Proc. VLDB Endow. 6(1), 25–36 (2012)
Zhang, J., Xiao, X., Xie, X.: PrivTree: a differentially private algorithm for hierarchical decompositions. In: Proceedings of the 2016 International Conference on Management of Data, pp. 155–170. ACM (2016)
Acknowledgement
This work is partially supported by National Natural Science Foundation of China (NSFC) under Grant No. 61772491, No. U170921, Natural Science Foundation of Jiangsu Province under Grant No. BK20161256, and Anhui Initiative in Quantum Information Technologies AHY150300.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A
Lemma 4
For \(\forall \ \delta >0\), and x is a draw from Lap(b), then:
where \(\mathrm {P}\) denotes the probability.
Proof
Hence, this lemma follows.
Appendix B
Proof of Theorem 2: Fix \(D=D'\cup \{t\}\), where t is a transaction. Let v, respectively \(v'\), denote the vector of query counts of SRNM when the dataset is D, respectively \(D'\). We use m to denote the number of queries(equal to the number of candidate itemsets). Then we have:
-
(1)
\(v_{i}\ge v^{'}_{i}\) for \(\forall i\in [m]\);
-
(2)
\(1+v^{'}_{i}\ge v_{i}\) for \(\forall i\in [m]\);
Given an integer z, for every \(z'\in [z]\), fix any set \(j=(j_{1},j_{2}, \ldots ,j_{z'})\in [m]^{z'}\), to prove differential privacy, we want to bound the ratio(from above and below) of the probabilities that \((j_{1},j_{2}, \ldots ,j_{z'})\) is selected with D and with D.
Fix \(r_{-j}\), which is a draw from \([Lap(z/\epsilon )]^{m-z'}\) and is used for all noisy query counts except \(z'\) counts corresponding to \(j=(j_{1},j_{2}, \ldots ,j_{z'})\). We use \(\mathrm {P}[j|\theta ]\) to denote the probability that the outputs of SRNM is j under condition \(\theta \).
Firstly, we prove that \(\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\): For every \(k\in j\), define
Then j is the output with D iff for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}\).
For all \(i\in [m]\backslash j, k\in j\):
So, if for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}+1\), then the output with \(D'\) will be j and the added noise will be \((r_{j},r_{-j})\). So we have:
The second equality is due to Lemma 4. multiply by \(e^{\epsilon }\): \(\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\)
We now prove that \(\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\). For every \(k\in j\), define: \(r_{k}^{*}=\mathrm {min}_{r_{k}}:v_{k}^{'}+r_{k} > v_{i}^{'}+r_{i}, \forall i \in [m]\backslash j\), then j is the output when the dataset is \(D'\) iff for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}\).
For all \(i\in [m]\backslash j, k\in j\):
So, if for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}+1\), then the output with D will be j and the added noise will be \((r_{j},r_{-j})\). So we have:
multiply by \(e^{\epsilon }\): \(\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D,r_{-j}]\). Hence this theorem follows.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, J., Han, K., Song, P., Xu, C., Gui, F. (2019). PrivBUD-Wise: Differentially Private Frequent Itemsets Mining in High-Dimensional Databases. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-26072-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26071-2
Online ISBN: 978-3-030-26072-9
eBook Packages: Computer ScienceComputer Science (R0)