Abstract
This paper proposes a cell-suppression based k-anonymization method which keeps minimal the loss of utility. The proposed method uses the Kullback-Leibler (KL) divergence as a utility measure derived from the notions developed in the literature of incomplete data analysis, including the missing-at-random (MAR) condition. To be more specific, we plug the KL divergence into an bottom-up, greedy procedure for a local recoding k-anonymization as a cost function which is efficiently computed. We focus on classification datasets and experimental results exhibit that the proposed method yields a small degradation of classification performance when combined with naive Bayes classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Agglomerative clustering is a typical hierachical clustering in which we start with initial clusters containing single tuples and merge the closest pair of clusters in a bottom-up manner [10].
- 2.
To be precise, the original definition by Harada et al. [7] does not consider classification datasets.
- 3.
To be precise, learning empirical probabilities using the pseudo count \(\alpha \), shown in Sect. 2.1, is called maximum a posteriori (MAP) estimation. ML estimation is a special case of MAP estimation where \(\alpha =0\). The following discussions can be easily extended to the case of MAP estimation.
- 4.
Joint distributions decomposed in this way are called selection models [17].
- 5.
Extending the discussion to the case with multiple i.i.d. (independent and identically distributed) tuples \(\{({{\varvec{y}}}^{(1)},c^{(1)}),({{\varvec{y}}}^{(2)},c^{(2)}),\ldots ,({{\varvec{y}}}^{(N)},c^{(N)})\}\) is fairly straightforward, since the likelihood can be transformed as \(L(\theta ,\phi )=\prod _i p({{\varvec{y}}}^{(i)},c^{(i)})=(\prod _i p({{\varvec{r}}}^{(i)}\mid {{\varvec{x}}}^{(i)},c^{(i)},\phi )) (\prod _i p({{\varvec{x}}}^{(i)},c^{(i)}\mid \theta ))\), where \({{\varvec{x}}}^{(i)}\) is the original of \({{\varvec{y}}}^{(i)}\).
References
Aggarwal, C.C.: Data Mining: The Textbook. Springer, Switzerland (2015)
Bayardo, R.J., Agrawal, R.: Data privacy through optimal \(k\)-anonymization. In: Proceedings of ICDE-05, pp. 217–228 (2005)
Dewri, R., Ray, I., Ray, I., Whitley, D.: On the optimal selection of \(k\) in the \(k\)-anonymity problem. In: Proceedings of ICDE-08, pp. 1364–1366 (2008)
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14:1–14:53 (2010)
Fung, B.C.M., Wang, K., Yu, P.S.: Anonymizing classification data for privacy preservatio. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)
Harada, K., Sato, Y., Togashi, Y.: Reducing amount of information loss in \(k\)-anonymization for secondary use of collected personal information. In: Proceedings of the 2012 Service Research and Innovation Institute Global Conference, pp. 61–69 (2012)
Heitjan, D.F.: Ignorability and coarse data. Ann. Stat. 19(4), 2244–2253 (1991)
Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of KDD-02, pp. 279–288 (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: Proceedings of SIGMOD-06, pp. 217–228 (2006)
Kisilevich, S., Rokach, L., Elovici, Y.: Efficient multidimensional suppresion for \(k\)-anonymity. IEEE Trans. Knowl. Data Eng. 22(3), 334–347 (2010)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain \(k\)-anonymity. In: Proceedings of SIGMOD-05, pp. 49–60 (2005)
Meyerson, A., Williams, R.: On the complexity of optimal \(k\)-anonymity. In: Proceedings of PODS-04, pp. 223–228 (2004)
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 670–682 (2001)
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002)
Sweeney, L.: Achieving \(k\)-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of ICDM-04, pp. 249–256 (2004)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.C.: Utility-based anonymization using local recoding. In: Proceedings of KDD-06, pp. 785–790 (2006)
Xu, L., Jiang, C., Chen, Y., Wang, J., Ren, Y.: A framework for categorizing and applying privacy-preservation techniques in big data mining. Computer 49(2), 54–62 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Derivation of the Proposed Suppression Cost
Appendix: Derivation of the Proposed Suppression Cost
Here, we complete the derivation of the cost function \(\varGamma _\mathrm{mar}\) by showing how to obtain Eqs. 7 and 9. First, let us note that \(\hat{p}(c)=\hat{q}(c)\) holds since the class label c is initially non-null and will be never suppressed. Equation 7 is then derived as follows:


In Eqs. 13 and 14, we carefully reordered summations and moved irrelevant factors outside the summations wherever possible. Equation 15 was finally derived using \(\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)=1\) since \(\hat{p}\) is a probability function.
On the other hand, for Eq. 9, we have been considering a specific case where the j-th non-null attribute value \(x_j\) of a tuple \(t=({{\varvec{y}}},c)\) is suppressed. We have \(\hat{q}(x_j\mid c)=(N(x_j,c)+\alpha )/(N(\lnot \bot _j,c)+\alpha |\mathcal{V}_j|)\) and \(\hat{q}'(x_j\mid c)=(N(x_j,c)-N({{\varvec{y}}},c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)\) as already mentioned, and additionally, for each value \(x'_j\) of j-th attribute which is not suppressed this time (i.e. \(x'_j\ne x_j\)), we have \(\hat{q}'(x'_j\mid c)=(N(x'_j,c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)\). Substituting these probabilities into Eq. 8 results in Eq. 9 as follows:

Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kameya, Y., Hayashi, K. (2016). Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition. In: Katsikas, S., Lambrinoudakis, C., Furnell, S. (eds) Trust, Privacy and Security in Digital Business. TrustBus 2016. Lecture Notes in Computer Science(), vol 9830. Springer, Cham. https://doi.org/10.1007/978-3-319-44341-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-44341-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44340-9
Online ISBN: 978-3-319-44341-6
eBook Packages: Computer ScienceComputer Science (R0)