Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition

Kameya, Yoshitaka; Hayashi, Kentaro

doi:10.1007/978-3-319-44341-6_5

Yoshitaka Kameya¹⁶ &
Kentaro Hayashi¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9830))

Included in the following conference series:

International Conference on Trust and Privacy in Digital Business

541 Accesses

Abstract

This paper proposes a cell-suppression based k-anonymization method which keeps minimal the loss of utility. The proposed method uses the Kullback-Leibler (KL) divergence as a utility measure derived from the notions developed in the literature of incomplete data analysis, including the missing-at-random (MAR) condition. To be more specific, we plug the KL divergence into an bottom-up, greedy procedure for a local recoding k-anonymization as a cost function which is efficiently computed. We focus on classification datasets and experimental results exhibit that the proposed method yields a small degradation of classification performance when combined with naive Bayes classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Iterative local search for preserving data privacy

Article Open access 20 December 2024

Privacy Aware K-Means Clustering with High Utility

Efficient computation of deletion-robust k-coverage queries

Article 13 January 2021

Notes

1.
Agglomerative clustering is a typical hierachical clustering in which we start with initial clusters containing single tuples and merge the closest pair of clusters in a bottom-up manner [10].
2.
To be precise, the original definition by Harada et al. [7] does not consider classification datasets.
3.
To be precise, learning empirical probabilities using the pseudo count $\alpha $, shown in Sect. 2.1, is called maximum a posteriori (MAP) estimation. ML estimation is a special case of MAP estimation where $\alpha =0$. The following discussions can be easily extended to the case of MAP estimation.
4.
Joint distributions decomposed in this way are called selection models [17].
5.
Extending the discussion to the case with multiple i.i.d. (independent and identically distributed) tuples $\{({{\varvec{y}}}^{(1)},c^{(1)}),({{\varvec{y}}}^{(2)},c^{(2)}),\ldots ,({{\varvec{y}}}^{(N)},c^{(N)})\}$ is fairly straightforward, since the likelihood can be transformed as $L(\theta ,\phi )=\prod _i p({{\varvec{y}}}^{(i)},c^{(i)})=(\prod _i p({{\varvec{r}}}^{(i)}\mid {{\varvec{x}}}^{(i)},c^{(i)},\phi )) (\prod _i p({{\varvec{x}}}^{(i)},c^{(i)}\mid \theta ))$, where ${{\varvec{x}}}^{(i)}$ is the original of ${{\varvec{y}}}^{(i)}$.

References

Aggarwal, C.C.: Data Mining: The Textbook. Springer, Switzerland (2015)
Book MATH Google Scholar
Bayardo, R.J., Agrawal, R.: Data privacy through optimal $k$-anonymization. In: Proceedings of ICDE-05, pp. 217–228 (2005)
Google Scholar
Dewri, R., Ray, I., Ray, I., Whitley, D.: On the optimal selection of $k$ in the $k$-anonymity problem. In: Proceedings of ICDE-08, pp. 1364–1366 (2008)
Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)
Article MATH Google Scholar
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14:1–14:53 (2010)
Article Google Scholar
Fung, B.C.M., Wang, K., Yu, P.S.: Anonymizing classification data for privacy preservatio. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)
Article Google Scholar
Harada, K., Sato, Y., Togashi, Y.: Reducing amount of information loss in $k$-anonymization for secondary use of collected personal information. In: Proceedings of the 2012 Service Research and Innovation Institute Global Conference, pp. 61–69 (2012)
Google Scholar
Heitjan, D.F.: Ignorability and coarse data. Ann. Stat. 19(4), 2244–2253 (1991)
Article MathSciNet MATH Google Scholar
Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of KDD-02, pp. 279–288 (2002)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: Proceedings of SIGMOD-06, pp. 217–228 (2006)
Google Scholar
Kisilevich, S., Rokach, L., Elovici, Y.: Efficient multidimensional suppresion for $k$-anonymity. IEEE Trans. Knowl. Data Eng. 22(3), 334–347 (2010)
Article Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain $k$-anonymity. In: Proceedings of SIGMOD-05, pp. 49–60 (2005)
Google Scholar
Meyerson, A., Williams, R.: On the complexity of optimal $k$-anonymity. In: Proceedings of PODS-04, pp. 223–228 (2004)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Article MathSciNet MATH Google Scholar
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 670–682 (2001)
Article Google Scholar
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002)
Article Google Scholar
Sweeney, L.: Achieving $k$-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Article MathSciNet MATH Google Scholar
Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of ICDM-04, pp. 249–256 (2004)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.C.: Utility-based anonymization using local recoding. In: Proceedings of KDD-06, pp. 785–790 (2006)
Google Scholar
Xu, L., Jiang, C., Chen, Y., Wang, J., Ren, Y.: A framework for categorizing and applying privacy-preservation techniques in big data mining. Computer 49(2), 54–62 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Meijo University, 1-501 Shiogama-guchi, Tenpaku-ku, Nagoya, 468-8502, Japan
Yoshitaka Kameya & Kentaro Hayashi

Authors

Yoshitaka Kameya
View author publications
You can also search for this author in PubMed Google Scholar
Kentaro Hayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoshitaka Kameya .

Editor information

Editors and Affiliations

Norwegian University of Science and Technology , Gjøvik, Norway
Sokratis Katsikas
University of Piraeus , Piraeus, Greece
Costas Lambrinoudakis
Plymouth University , Plymouth, United Kingdom
Steven Furnell

Appendix: Derivation of the Proposed Suppression Cost

Here, we complete the derivation of the cost function $\varGamma _\mathrm{mar}$ by showing how to obtain Eqs. 7 and 9. First, let us note that $\hat{p}(c)=\hat{q}(c)$ holds since the class label c is initially non-null and will be never suppressed. Equation 7 is then derived as follows:

(13)

(14)

$$\begin{aligned}= & {} \sum _c \hat{p}(c) \sum _{j=1}^M \sum _{x_j}\hat{p}(x_j\mid c)\log \frac{\hat{p}(x_j\mid c)}{\hat{q}(x_j\mid c)} \Bigl (\prod _{j'=1}^{j-1}\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)\Bigr )\Bigl (\prod _{j'=j+1}^M\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)\Bigr ) \nonumber \\= & {} \sum _c \hat{p}(c) \sum _{j=1}^M \sum _{x_j}\hat{p}(x_j\mid c)\log \frac{\hat{p}(x_j\mid c)}{\hat{q}(x_j\mid c)}. \end{aligned}$$

(15)

In Eqs. 13 and 14, we carefully reordered summations and moved irrelevant factors outside the summations wherever possible. Equation 15 was finally derived using $\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)=1$ since $\hat{p}$ is a probability function.

On the other hand, for Eq. 9, we have been considering a specific case where the j-th non-null attribute value $x_j$ of a tuple $t=({{\varvec{y}}},c)$ is suppressed. We have $\hat{q}(x_j\mid c)=(N(x_j,c)+\alpha )/(N(\lnot \bot _j,c)+\alpha |\mathcal{V}_j|)$ and $\hat{q}'(x_j\mid c)=(N(x_j,c)-N({{\varvec{y}}},c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)$ as already mentioned, and additionally, for each value $x'_j$ of j-th attribute which is not suppressed this time (i.e. $x'_j\ne x_j$), we have $\hat{q}'(x'_j\mid c)=(N(x'_j,c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)$. Substituting these probabilities into Eq. 8 results in Eq. 9 as follows:

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kameya, Y., Hayashi, K. (2016). Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition. In: Katsikas, S., Lambrinoudakis, C., Furnell, S. (eds) Trust, Privacy and Security in Digital Business. TrustBus 2016. Lecture Notes in Computer Science(), vol 9830. Springer, Cham. https://doi.org/10.1007/978-3-319-44341-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-44341-6_5
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44340-9
Online ISBN: 978-3-319-44341-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Iterative local search for preserving data privacy

Privacy Aware K-Means Clustering with High Utility

Efficient computation of deletion-robust k-coverage queries

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Derivation of the Proposed Suppression Cost

Appendix: Derivation of the Proposed Suppression Cost

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us