Skip to main content
Log in

Multivariate microaggregation by iterative optimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Microaggregation is a well-known perturbative approach to publish personal or financial records while preserving the privacy of data subjects. Microaggregation is also a mechanism to realize the k-anonymity model for Privacy Preserving Data Publishing (PPDP). Microaggregation consists of two successive phases: partitioning the underlying records into small clusters with at least k records and aggregating the clustered records by a special kind of cluster statistic as a replacement. Optimal multivariate microaggregation has been shown to be NP-hard. Several heuristic approaches have been proposed in the literature. This paper presents an iterative optimization method based on the optimal solution of the microaggregation problem (IMHM). The method builds the groups based on constrained clustering and linear programming relaxation and fine-tunes the results within an integrated iterative approach. Experimental results on both synthetic and real-world data sets show that IMHM introduces less information loss for a given privacy parameter, and can be adopted for different real world applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The pseudo-code of CBFS is very similar to MDAV in Fig. 1, but line 6 is removed and the main loop is iterated until less than k unassigned records remain (2k changes to k in line 7 of Fig. 1).

  2. The terms “merge” and “noMerge” are used in the DBA paper.

  3. Referred to as “Lloyd quantization design algorithm” in the original paper.

  4. Referred to as “cell” in the original paper.

  5. Number of Pre-partitioned Blocks (NPB).

  6. We have also used a semi-random initialization, which is discussed in Sect. 5.4.

  7. Here, the index of each centroid is its implicit number.

  8. A centroid is called unassigned, if none of its members are assigned yet. Similarly, a remained record is a record not in the Path.

  9. We assume, NPB=1, i.e., |B|=|T|=n.

  10. Please note that another procedure reorders records in a path for ImprovedMHM algorithm (line 19 in Fig. 2), so the term nlogn does not make sense here.

  11. Please see the Appendix.

  12. For simplicity, we removed the time complexity related to the dimension, d.

  13. In the referred paper, number of clusters (c) and records in a cluster (n i ) was chosen randomly.

References

  1. Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570

    Article  MathSciNet  MATH  Google Scholar 

  2. Machanavajjhala A et al (2007) L-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data 1(1):3

    Article  Google Scholar 

  3. Ninghui L, Tiancheng L, Venkatasubramanian S (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In: ICDE

    Google Scholar 

  4. Hong TP et al (2012) Using TF-IDF to hide sensitive itemsets. Appl Intell 1–9

  5. Yin Y et al (2011) Privacy-preserving data mining. In: Data mining. Springer, Berlin

    Chapter  Google Scholar 

  6. Domingo-Ferrer J, Solanas A, Martinez-Balleste A (2006) Privacy in statistical databases: k-anonymity through microaggregation. In: Proceedings of IEEE granular computing, pp 774–777

    Google Scholar 

  7. Oganian A, Karr AF (2011) Masking methods that preserve positivity constraints in microdata. J Stat Plan Inference 141(1):31–41

    Article  MathSciNet  MATH  Google Scholar 

  8. Domingo-Ferrer J, Torra V (2001) Disclosure control methods and information loss for microdata. In: Doyle P, Lane JI, Theeuwes JJM, Zayatz L (eds) Confidentiality, disclosure, and data access: theory and practical applications for statistical agencies, pp 93–112

    Google Scholar 

  9. Xu C et al (2012) Efficient fuzzy ranking queries in uncertain databases. Appl Artif Intell 37(1):47–59

    Article  Google Scholar 

  10. Wong RC-W et al (2006) (α,k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 754–759

    Chapter  Google Scholar 

  11. Sun X, Li M, Wang H (2011) A family of enhanced-diversity models for privacy preserving data publishing. Future Gener Comput Syst 27(3):348–356

    Article  MATH  Google Scholar 

  12. Sun X, Sun L, Wang H (2011) Extended k-anonymity models against sensitive attribute disclosure. Comput Commun 34(4):526–535

    Article  Google Scholar 

  13. Defays D, Nanopoulos P (1993) Panels of enterprises and confidentiality: the small aggregates method

  14. Domingo-Ferrer J, Mateo-Sanz JM (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1):189–201

    Article  Google Scholar 

  15. Domingo-Ferrer J, Sramka M Disclosure control by computer scientists: an overview and an application of microaggregation to mobility data anonymization

  16. Rebollo-Monedero D, Forné J, Soriano M (2011) An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers. Data Knowl Eng

  17. Domingo-Ferrer J, Torra V (2005) Ordinal continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212

    Article  MathSciNet  Google Scholar 

  18. Torra V (2004) In: Domingo-Ferrer J, Torra V (eds) Microaggregation for categorical variables: a median based approach, privacy in statistical databases. Springer, Berlin, pp 162–174

    Chapter  Google Scholar 

  19. Hansen SL, Mukherjee S (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans Knowl Data Eng 15(4):1043–1044

    Article  Google Scholar 

  20. Solanas A, Sebé F, Domingo-Ferrer J (2008) Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond. In: Proceedings of the 2008 international workshop on privacy and anonymity in information society. ACM, New York

    Google Scholar 

  21. Domingo-Ferrer J, Sebé F, Solanas A (2008) An anonymity model achievable via microaggregation. In: Secure data management, pp 209–218

    Chapter  Google Scholar 

  22. Jian-min H, Ting-ting C, Hui-qun Y (2008) An improved V-MDAV algorithm for l-diversity. In: 2008 international symposiums on information processing (ISIP). IEEE, New York

    Google Scholar 

  23. Skinner CJ et al (1994) Disclosure control for census microdata. J Off Stat 10(1):31–51

    Google Scholar 

  24. Nin J, Herranz J, Torra V (2008) On the disclosure risk of multivariate microaggregation. Data Knowl Eng 67(3):399–412

    Article  Google Scholar 

  25. Yancey W, Winkler W, Creecy R (2002) Disclosure risk assessment in perturbative microdata protection. In: Inference control in statistical databases, pp 49–60

    Google Scholar 

  26. Chang C-C, Li Y-C, Huang W-H (2007) TFRP: an efficient microaggregation algorithm for statistical disclosure control. J Syst Softw 80(11):1866–1878

    Article  Google Scholar 

  27. Domingo-Ferrer J, González-Nicolás Ú (2010) Hybrid microdata using microaggregation. Inf Sci 180(15):2834–2844

    Article  Google Scholar 

  28. Sun X et al (2012) An approximate microaggregation approach for microdata protection. Expert Syst Appl 39(2):2211–2219

    Article  Google Scholar 

  29. Crises G (2004) Microaggregation for privacy protection in statistical databases. Rovira i Virgili, Univ. Tarragona, Spain, Tech. Rep. CRIREP-04-005

  30. Hundepool A et al (2006) CENEX SDC handbook on statistical disclosure control, version 1.01

    Google Scholar 

  31. Oganian A, Domingo-Ferrer J (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat J United Nations Econ Comission Eur 18(4):345–354

    Google Scholar 

  32. Defays D, Nanopoulos P (1993) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 symposium on design and analysis of longitudinal surveys, Ottawa, Canada

    Google Scholar 

  33. Mateo Sanz JM, Domingo i Ferrer J (1998) A comparative study of microaggregation methods. Questiió 22(3):511–526

    MATH  Google Scholar 

  34. Domingo-Ferrer J et al (2006) Efficient multivariate data-oriented microaggregation. VLDB J 15(4):355–369

    Article  Google Scholar 

  35. Heaton B (2012) New record ordering heuristics for multivariate microaggregation. Nova Southeastern University

  36. Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans Knowl Data Eng 17(7):902–911

    Article  Google Scholar 

  37. Solanas A, Martinez-Balleste A, Domingo-Ferrer J (2006) V-MDAV: a multivariate microaggregation with variable group size. In: COMPSTAT’2006

    Google Scholar 

  38. Chang CC, Li YC, Huang WH (2007) TFRP: an efficient microaggregation algorithm for statistical disclosure control. J Syst Softw 80(11):1866–1878

    Article  Google Scholar 

  39. Solanas A et al (2006) Multivariate microaggregation based genetic algorithms. In: 2006 3rd international IEEE conference on intelligent systems

    Google Scholar 

  40. Lin JL et al (2010) Density-based microaggregation for statistical disclosure control. Expert Syst Appl 37(4):3256–3263

    Article  Google Scholar 

  41. Rebollo-Monedero D, Forné J, Soriano M (2011) An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers. Data Knowl Eng 70(10):892–921

    Article  Google Scholar 

  42. Panagiotakis C, Tziritas G (2011) Successive group selection for microaggregation. IEEE Trans Knowl Data Eng. http://www.computer.org/csdl/trans/tk/preprint/ttk2011990169-abs.html

  43. Li Y et al (2002) A privacy-enhanced microaggregation method. Foundations of Information and Knowledge Systems, 237–250

  44. Solanas A (2008) In: Yang A, Shan Y, Bui L (eds) Success in evolutionary computation. Springer, Berlin, pp 215–237

    Chapter  Google Scholar 

  45. Fayyoumi E, Oommen BJ (2010) A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases. Softw Pract Exp 40(12):1161–1188

    Article  Google Scholar 

  46. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations, California, USA

  47. Cormen TH et al (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge, p 850

    MATH  Google Scholar 

  48. Ghouila-Houri A (1962) Caractérisation des matrices totalement unimodulaires. C R Math Acad Sci 254:1192–1194

    MathSciNet  MATH  Google Scholar 

  49. Korte BH, Vygen J (2006) Combinatorial optimization: theory and algorithms. Algorithms and combinatorics, vol 21. Springer, Berlin

    Google Scholar 

  50. Wong RCW et al. (2007) Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment

  51. Mosek AS (2010) The MOSEK optimization software. Online at http://www.mosek.com

  52. Brand R, Domingo-Ferrer J, Mateo-Sanz J (2002) Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC. http://neon.vb.cbs.nl/casc

  53. UCI KDD archive. Available from: http://kdd.ics.uci.edu/

  54. S, H. and B. SD (1999) The UCI KDD archive

  55. Cormode G et al (2010) Minimizing minimality and maximizing utility: analyzing method-based attacks on anonymized data. Proc VLDB Endow 3(1–2):1045–1056

    Google Scholar 

Download references

Acknowledgement

This research is partially supported by ITRC (Iran Telecommunication Research Center) under contract No. 12200/500. We would like to thank the editor and anonymous reviewers for their detailed comments that help improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed Jalili.

Appendix: Proofs related to the ImprovedMHM function (Fig. 4)

Appendix: Proofs related to the ImprovedMHM function (Fig. 4)

Assume a group of records G with n members {X 1,X 2,…,X n } in a d dimensional space. Let \(C_{1}^{n}[l]= \sum_{j\in \{1,2,\ldots ,n\}}X_{j}[l]/n\) represents the lth dimension (1≤ld) of the centroid of G, where X j [l] denotes the lth dimension of X j . Based on Eq. (1), it is easy to verify that the SSE of the group can be computed by SSE(G)=SSE(G[1])+SSE(G[2])+⋯+SSE(G[d]), where G[l]={X 1[l],X 2[l],…,X n [l]}.

Lemma 1

If a new record X n+1 is added to a group G 1, the SSE of the new group G 2=G 1∪{X n+1} denoted by SSE 2 can be calculated based on the calculated SSE before the addition (SSE 1) in O(1).

Proof

Let ΔSSE=SSE 2SSE 1=∑1≤ld SSE 2[l]−SSE 1[l]. Based on Eq. (1), we can expand the ΔSSE[l]:

(3)

The new centroid can be computed simply:

$$ C_{1}^{n+1}[l]={\bigl(n.C_{1}^{n}[l]+X_{n+1}[l] \bigr)}/{(n+1)}. $$
(4)

So Eq. (3) reduces to:

(5)

These two updates require O(1) computations. □

Equations (4) and (5) are used in the ImprovedMHM function (lines 25–26 of Fig. 4) to update the SSE and centroid of a new group.

Lemma 2

If a record X n+1 replaces a member in the group G, say X 1, the SSE of the new group can be calculated in O(1).

Proof

We have \(\mathit{SSE}_{2}[l]=\sum_{2\leq j\leq n+1}(X_{j}[l]-C_{2}^{n+1}[l])^{2}\). Let Δ[l]=X n+1[l]−X 1[l], so \(C_{2}^{n+1}[l]=C_{1}^{n}[l]+\Delta [l]/n\) and the SSE can be updated incrementally:

(6)

Equation (6) is used in line 19 of Fig. 4, and requires O(1) computations. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mortazavi, R., Jalili, S. & Gohargazi, H. Multivariate microaggregation by iterative optimization. Appl Intell 39, 529–544 (2013). https://doi.org/10.1007/s10489-013-0431-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-013-0431-y

Keywords

Navigation