Skip to main content

Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition

  • Conference paper
  • First Online:
Trust, Privacy and Security in Digital Business (TrustBus 2016)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9830))

Included in the following conference series:

  • 541 Accesses

Abstract

This paper proposes a cell-suppression based k-anonymization method which keeps minimal the loss of utility. The proposed method uses the Kullback-Leibler (KL) divergence as a utility measure derived from the notions developed in the literature of incomplete data analysis, including the missing-at-random (MAR) condition. To be more specific, we plug the KL divergence into an bottom-up, greedy procedure for a local recoding k-anonymization as a cost function which is efficiently computed. We focus on classification datasets and experimental results exhibit that the proposed method yields a small degradation of classification performance when combined with naive Bayes classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Agglomerative clustering is a typical hierachical clustering in which we start with initial clusters containing single tuples and merge the closest pair of clusters in a bottom-up manner [10].

  2. 2.

    To be precise, the original definition by Harada et al. [7] does not consider classification datasets.

  3. 3.

    To be precise, learning empirical probabilities using the pseudo count \(\alpha \), shown in Sect. 2.1, is called maximum a posteriori (MAP) estimation. ML estimation is a special case of MAP estimation where \(\alpha =0\). The following discussions can be easily extended to the case of MAP estimation.

  4. 4.

    Joint distributions decomposed in this way are called selection models [17].

  5. 5.

    Extending the discussion to the case with multiple i.i.d. (independent and identically distributed) tuples \(\{({{\varvec{y}}}^{(1)},c^{(1)}),({{\varvec{y}}}^{(2)},c^{(2)}),\ldots ,({{\varvec{y}}}^{(N)},c^{(N)})\}\) is fairly straightforward, since the likelihood can be transformed as \(L(\theta ,\phi )=\prod _i p({{\varvec{y}}}^{(i)},c^{(i)})=(\prod _i p({{\varvec{r}}}^{(i)}\mid {{\varvec{x}}}^{(i)},c^{(i)},\phi )) (\prod _i p({{\varvec{x}}}^{(i)},c^{(i)}\mid \theta ))\), where \({{\varvec{x}}}^{(i)}\) is the original of \({{\varvec{y}}}^{(i)}\).

References

  1. Aggarwal, C.C.: Data Mining: The Textbook. Springer, Switzerland (2015)

    Book  MATH  Google Scholar 

  2. Bayardo, R.J., Agrawal, R.: Data privacy through optimal \(k\)-anonymization. In: Proceedings of ICDE-05, pp. 217–228 (2005)

    Google Scholar 

  3. Dewri, R., Ray, I., Ray, I., Whitley, D.: On the optimal selection of \(k\) in the \(k\)-anonymity problem. In: Proceedings of ICDE-08, pp. 1364–1366 (2008)

    Google Scholar 

  4. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)

    Article  MATH  Google Scholar 

  5. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14:1–14:53 (2010)

    Article  Google Scholar 

  6. Fung, B.C.M., Wang, K., Yu, P.S.: Anonymizing classification data for privacy preservatio. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)

    Article  Google Scholar 

  7. Harada, K., Sato, Y., Togashi, Y.: Reducing amount of information loss in \(k\)-anonymization for secondary use of collected personal information. In: Proceedings of the 2012 Service Research and Innovation Institute Global Conference, pp. 61–69 (2012)

    Google Scholar 

  8. Heitjan, D.F.: Ignorability and coarse data. Ann. Stat. 19(4), 2244–2253 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  9. Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of KDD-02, pp. 279–288 (2002)

    Google Scholar 

  10. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  11. Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: Proceedings of SIGMOD-06, pp. 217–228 (2006)

    Google Scholar 

  12. Kisilevich, S., Rokach, L., Elovici, Y.: Efficient multidimensional suppresion for \(k\)-anonymity. IEEE Trans. Knowl. Data Eng. 22(3), 334–347 (2010)

    Article  Google Scholar 

  13. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain \(k\)-anonymity. In: Proceedings of SIGMOD-05, pp. 49–60 (2005)

    Google Scholar 

  14. Meyerson, A., Williams, R.: On the complexity of optimal \(k\)-anonymity. In: Proceedings of PODS-04, pp. 223–228 (2004)

    Google Scholar 

  15. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  16. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 670–682 (2001)

    Article  Google Scholar 

  17. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002)

    Article  Google Scholar 

  18. Sweeney, L.: Achieving \(k\)-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  19. Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of ICDM-04, pp. 249–256 (2004)

    Google Scholar 

  20. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  21. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.C.: Utility-based anonymization using local recoding. In: Proceedings of KDD-06, pp. 785–790 (2006)

    Google Scholar 

  22. Xu, L., Jiang, C., Chen, Y., Wang, J., Ren, Y.: A framework for categorizing and applying privacy-preservation techniques in big data mining. Computer 49(2), 54–62 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoshitaka Kameya .

Editor information

Editors and Affiliations

Appendix: Derivation of the Proposed Suppression Cost

Appendix: Derivation of the Proposed Suppression Cost

Here, we complete the derivation of the cost function \(\varGamma _\mathrm{mar}\) by showing how to obtain Eqs. 7 and 9. First, let us note that \(\hat{p}(c)=\hat{q}(c)\) holds since the class label c is initially non-null and will be never suppressed. Equation 7 is then derived as follows:

(13)
(14)
$$\begin{aligned}= & {} \sum _c \hat{p}(c) \sum _{j=1}^M \sum _{x_j}\hat{p}(x_j\mid c)\log \frac{\hat{p}(x_j\mid c)}{\hat{q}(x_j\mid c)} \Bigl (\prod _{j'=1}^{j-1}\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)\Bigr )\Bigl (\prod _{j'=j+1}^M\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)\Bigr ) \nonumber \\= & {} \sum _c \hat{p}(c) \sum _{j=1}^M \sum _{x_j}\hat{p}(x_j\mid c)\log \frac{\hat{p}(x_j\mid c)}{\hat{q}(x_j\mid c)}. \end{aligned}$$
(15)

In Eqs. 13 and 14, we carefully reordered summations and moved irrelevant factors outside the summations wherever possible. Equation 15 was finally derived using \(\sum _{x_{j'}}\hat{p}(x_{j'}\mid c)=1\) since \(\hat{p}\) is a probability function.

On the other hand, for Eq. 9, we have been considering a specific case where the j-th non-null attribute value \(x_j\) of a tuple \(t=({{\varvec{y}}},c)\) is suppressed. We have \(\hat{q}(x_j\mid c)=(N(x_j,c)+\alpha )/(N(\lnot \bot _j,c)+\alpha |\mathcal{V}_j|)\) and \(\hat{q}'(x_j\mid c)=(N(x_j,c)-N({{\varvec{y}}},c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)\) as already mentioned, and additionally, for each value \(x'_j\) of j-th attribute which is not suppressed this time (i.e. \(x'_j\ne x_j\)), we have \(\hat{q}'(x'_j\mid c)=(N(x'_j,c)+\alpha )/(N(\lnot \bot _j,c)-N({{\varvec{y}}},c)+\alpha |\mathcal{V}_j|)\). Substituting these probabilities into Eq. 8 results in Eq. 9 as follows:

figure d

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kameya, Y., Hayashi, K. (2016). Bottom-Up Cell Suppression that Preserves the Missing-at-random Condition. In: Katsikas, S., Lambrinoudakis, C., Furnell, S. (eds) Trust, Privacy and Security in Digital Business. TrustBus 2016. Lecture Notes in Computer Science(), vol 9830. Springer, Cham. https://doi.org/10.1007/978-3-319-44341-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44341-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44340-9

  • Online ISBN: 978-3-319-44341-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics