1 Introduction

In the last decade the amount of data available globally has increased rapidly, reaching 79 zettabytes in 2021, with projections suggesting it will more than double by 2025.Footnote 1 At least two percent of the available data will be stored and potentially available online. Therefore, responsible data sharing policies will play a critical role to protect individuals against unlawful processing and disclosure of personal data. Mishandling data can lead to security breaches and unauthorized disclosure of sensitive or personal data.

The General Data Protection Regulation (GDPR) is an important milestone in reducing breaches and fraud involving personal data. GDPR gives organizations seven principles for processing of personal data, i.e., lawfulness, fairness and transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, and accountability. To comply with the integrity and confidentiality principle, companies commonly anonymize data.

According to the EU law, anonymous data is, “information which does not relate to an identifier or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” [1].

It has been proven that anonymizing datasets by merely suppressing identifiers is not enough to preserve privacy [2]. For example, a few years ago, a popular movie-streaming service released a dataset with movie rankings from over 50,000 users. Researchers later demonstrated that they could re-identify users by combining this data with publicly available information from IMDb.

k-anonymization [3,4,5] is a widely used privacy model designed to protect personal data against linkage attacks, which attempt to extract sensitive information by combining the data with external background information. In k-anonymization, each individual is indistinguishable from at least k-1 others in the dataset, reducing the probability of de-anonymization to 1/k. This model ensures that personal information remains difficult to link back to specific individuals.

Homogeneous anonymization requires all records within the same equivalence class (a group of k indistinguishable individuals) to have identical quasi-identifiers. To illustrate this, Table 2 shows a k-anonymized (with k=3) version of Table 1, with three equivalence classes. As can be seen, all QI-identifiers remain the same within the same group. We note that the anonymized version of the dataset must also randomly shuffle the order of the tuples to avoid trivial linkage attacks; we skip this step for illustrative purposes.

In contrast, heterogeneous anonymization allows variations within the equivalence class, maintaining k-anony-mity without requiring identical quasi-identifiers. This approach preserves more data utility, making the anony-mized data more useful for analysis while still providing a strong privacy protection. Table 4 outlines a heterogeneous k-anonymous version of Table 2. As can be seen in this example, the anonymized dataset allows non-reciprocal generalizations, e.g., Tuple 1 generalizes Tuple 2, but not the other way around. It is worth noting that homogeneous anonymization is a special case of heterogeneous anonymization. Therefore, well-optimized heterogeneous solutions provide at least the same utility while maintaining the same privacy properties.

Ideally, one would like to achieve maximum privacy without losing information. However, there is an inherent trade-off between decreasing the probability of re-identification (by increasing the value of k) and the capacity of performing a valuable statistical analysis of the anonymized dataset. Furthermore, the k-anonymization problem is known to be NP-hard with average exponential time complexity for heterogeneous and homogeneous k-anonymization.

Table 1 Sample dataset
Table 2 Homogeneous 3-Anonymization of Table 1

1.1 Contributions

In this paper, our contributions can be summarised as follows:

  • The design and implementation of a novel local search algorithm to tackle the heterogeneous k-anonymization problem. We equipped the algorithm with an incremental evaluation of the objective function capable of handling arbitrary information loss metrics.

  • The local search algorithm, designed as an anytime algorithm, allows users to flexibly trade off between computation time and solution quality. This means it can produce a valid solution even if interrupted, progressively improving the solution the longer it runs.

  • We evaluate the effectiveness of the proposed algorithm through extensive experiments on three well-known datasets from open repositories. Our local search framework consistently outperforms current state-of-the-art algorithms, demonstrating significant reductions in information loss. It shows strong scalability and efficiency across various datasets and k-values, providing performance improvements of up to 54% against k-members and 43% against l-greedy, which are considered leading algorithms for large-size data-sets with 20,000 tuples.

  • To the best of our knowledge, the proposed LS algorithm is the first solution approach to outperform the Hungarian-based solution on small-size instance, in particular our LS algorithm reduces the amount of information loss by up to 4,7% w.r.t. the Hungarian-based solution on the IPUMS dataset with 1,000 tuples.

2 Related work

In recent years, the k-anonymization problem has garnered significant attention in the field of data privacy, leading to the development of various heuristic and algorithmic solutions. This section reviews key contributions to the k-anonymization literature, focusing on approaches that tackle the problem through clustering, heuristic optimization, and genetic algorithms. Each method presents unique strategies for balancing information loss and computational efficiency, highlighting the evolving landscape of k-anonymization techniques. Table 4 summarises these approach.

Byun et al. [9] addresses the anonymization problem by framing it as a k-member clustering problem, with the objective is of finding clusters, each containing at least k individuals with minimal information loss. The authors propose the popular k-members greedy algorithm to incrementally solve the problem. The algorithm randomly selects the first member of the first cluster and then iteratively adds k individuals with minimal information loss. Subsequent clusters start with the most dissimilar individual to the last used tuple. In contrast, Mauger et al. [8] proposes hierarchical bottom-up clustering approach to tackle the problem. In this heuristic, each tuple starts with its own cluster, and pairs of clusters are merged by optimizing a given information loss metric until a given stopping criterion is met.

Liang and Samavi [10] proposed two heuristic solutions to tackle the homogeneous clustering problem. The first heuristic (Split & Carry) uses the divide and conquer paradigm to divide the problem into subproblems, then the authors use Gurobi to optimize individual subproblems. The Split & Carry solution is limited to numeric attributes, and given the complexity of the k-anonymization problem, this algorithm only works for relatively small k-values (k \(\le \) 5). The second heuristic solution (l-greedy), starts by sorting the dataset to incrementally create clusters with the best k elements. The authors show that l-greedy slightly outperforms the classical k-member method [9] for large datasets, however, the information loss is comparable to Doka et al. [7] for moderate-sized instances and the performance is worse than the Hungarian algorithm for heterogeneous k-anonymization. Nevertheless, it is worth noticing that l-greedy offers a better theoretical Big-O runtime complexity than [9] and [7].

ARX [17] is one of the most efficient libraries that can be used for data anonymization. It provides efficient implementations of multiple privacy models, including k-anonymization, l-diversity, and (\(\epsilon , \gamma \)) differential privacy. This library implements widely used algorithms for k-anonymization with pre-defined generalization hierarchies, such as breadth-first and genetic algorithms. While ARX outperforms prior algorithms, we remark that in this paper, we focus on k-anonymization without pre-defined generalization hierarchies, moreover the authors provided an extensive evaluation suggesting that specialised k-anonymization algorithms such as [6] outperform ARX. Of note, the definition of appropriate taxonomies is a tedious and non-trivial task, as discussed in [20].

In Doka et al. [7], the authors reformulate the heterogeneous k-anonymization problem as k iterative perfect matching problems and propose three heuristic solutions. The first heuristic incrementally selects edges in the graph with the minimum expected generalization cost; the second heuristic defines a linear ordering of the left nodes in the bipartite graph and systematically match those nodes with the best possible edges; the last heuristic implements the Hungarian algorithm to find the expected optimal matching. Of note, the Hungarian-based algorithm finds optimal solutions for the perfect matching problem w.r.t. the weights of the edges. However, in our settings (k-anonymization), the weights of the edges dynamically change during the execution of the algorithm, e.g., the weights in the first iteration are different than in the remaining ones. Therefore, the heuristic solution is unable to guarantee optimality. Notably, [10] independently confirmed that the Hungarian-based algorithm outperforms the current state-of-the-art heuristics for homogeneous k-anonymization. However, the algorithm requires a significant amount of CPU time to anonymize mid-size datasets.

Genetic algorithms (GAs) have been previously used to tackle the k-anonymization problem [11]. The GA defines the chromosomes as a sequence of binary numbers denoting some pre-defined generalizations of the QIs. Similarly to the ARX framework, the authors need to manually define tree-based generalization taxonomies for the attributes. Therefore, the GA encoding of the algorithm only works with predefined transformations or taxonomies. It remains unclear whether the algorithm can be extended beyond this constraint, as the representation of the solution is inherently bound to these taxonomies. In [12], the authors attempt to k-anonymize datasets with distributed data, while in [13], the authors propose the (p, l) diversity model to anonymize datasets with multiple sensitive attributes.

The seminal work of [9] remains a popular baseline reference for comparing newer algorithms such as those proposed in [11, 14,15,16, 18, 19, 21]. However, in this work, we refrain from considering these algorithms due to the impracticality of reproducing the proposed heuristics. Fundamental aspects, such as the definition of a gene or a population, remain ambiguous in the mentioned approaches.

Notably, the Adult dataset is the most popular dataset used to evaluate k-anonymization algorithms; 11 out of 15 papers use this dataset in their evaluations. The financial and healthcare datasets used in [3] and [12] come from private sources. Onesimu et al. [19] suggests that their healthcare data uses only three attributes. To the best of our knowledge, the IPUMS dataset used so far is the largest dataset with hundreds of thousands of publicly available records.

Further research in privacy-preserving techniques is examining the application of k-anonymity principles in various fields, such as machine learning [22], healtcare [20] cloud services [23]. Beyond k-anonymity, other privacy models have been developed to offer alternative protection mechanisms such as: [24,25,26,27].

Fig. 1
figure 1

Bipartite anonymization graphs

3 Background

In this paper, we assume that a dataset is a structured collection of information in a table form with a set of tuples (rows) and a set of attributes (columns) that can be divided into three main types, i.e., identifiers, quasi-identifiers (QI), and personal or sensitive data. An identifier is an attribute (or a set of attributes) that uniquely identifies an individual; quasi-identifiers (or indirect identifiers) are a set of attributes that might unambiguously identify an individual. However, individual QIs do not necessarily qualify as identifiers. Finally, the set of attributes representing sensitive data provides valuable information that must remain intact for statistical purposes.

Let us formally describe the k-anonymization problem as a k-regular bipartite graph G=(Q, Q’, E) where Q denotes the set of original tuples, Q’ denotes the set of anonymized tuples, and (\(q_i\), \(q'_j\)) \(\in \) E denotes the set of edges from Q to Q’ with the generalization matches between the original and the anonymized tuples. A k-regular graph is a graph where both the in-degree and out-degree are equal to k for all nodes.

Figure 1(a) shows a conventional homogeneous solution (Table 2), i.e., tuples are always grouped together to form equivalent groups, so that all individuals within the same group have identical generalizations. Alternatively, Fig. 1(b) shows a heterogeneous solution without the need of reciprocity (Table 3). Furthermore, as it can be observed both solutions satisfy the k-regular property (with k=3) as all nodes (left and right) in the graph have 3 incoming and outgoing edges. It is worth noticing that Doka et al. [7] demonstrates that k-regularity is enough to ensure k-anonymization for non-deterministic solutions (Table 4).

Table 3 Heterogeneous 3-Anonymization of Table 1
Table 4 Related work on k-anonymization

Heterogeneous and homogeneous k-anonymization offer the same privacy guarantees for random attacks. However, the homogeneous model is stronger for sustained attacks against adversaries with prior knowledge. In general, the security of the traditional homogeneous k-anonymization framework drops to \(k-c\), iff, the adversary is aware that c individuals belong to the same cluster in the anonymized dataset. In contrast, the security of the heterogeneous framework drops to \(k-c-\phi (k)\) if the adversary is aware of c true matches in the anonymized dataset, for a function \(\phi \) that ranges from 0 to \(\frac{34\cdot c^2}{k}\). To the best of our knowledge, there are no known additional weaknesses of heterogeneous k-anonymization. We refer the reader to Doka et al. [7] for a complete analysis of the theoretical privacy guarantees of homogeneous k-anonymization.

3.1 Information loss

Preserving the privacy of a given dataset via k-anonymi-zation leads to a certain degree of information loss. The goal is to find a balance that ensures protection in the anonymized data without compromising its quality and usability

Let qn (resp. qc) denote the set of numerical (resp. categorical) QIs in a given dataset. Furthermore, we use \(\rho _{i, j}\) and \(\tau _{i,j}\) (resp. \(\tau '_{i,j}\) and \(\rho '_{i,j}\)) to denote the QI-values (resp. generalization domains) of QI j of the i\(^{th}\) individual in the dataset. Thus, our working example in Tables 1 and 2 includes two numerical QIs, i.e., \(qn= \{\)age, salary\(\}\) and a single categorical QI, i.e., qc = \(\{\)city\(\}\), the age QI of the first individual is \(\rho _{1,age}\)=50 and its corresponding anonymized values correspond to \(\rho '_{1, age}\) = [30-50] and \(\tau _{1, city}\)= {Rome, Paris, Oslo}.

Let \(min_{i,j}\) and \(max_{i,j}\) denote the minimum and maximum QI-values among the matches in the anonymi-zation graph of \(qn_i\). Let \(min^j\) and \(max^j\) denote the min. and max. QI-values of QI j in entire the dataset. Equation 1 defines the Normalized Certain Penalty (NCP) metric for computing the information loss for numerical attributes.

The information loss incurred for the salary attribute of the first tuple in Table 2 is NPC(\(\rho _{1, salary}\)) = (42-22)/(42-15) = 0.74. Similarly, the information loss incurred for the corresponding item in Table 3 is (34-22)/(42-15) = 0.44.

$$\begin{aligned} NCP_{num}(\rho '_{i,j}) = \frac{max_{i,j} - min_{i,j}}{max^{j} - min^{j}} \end{aligned}$$
(1)

Binary encoding is a popular approach for handling categorical QIs without ordinal relationships. Each category for a given QI, \(qc_i \in q_i\), gets a new dummy binary variable indicating whether or not the anonymized tuple maps to a particular value. Formally, let \(\tau ^{v}_{i,j}\) be a dummy variable denoting whether the categorical variable v is part of the generalization domain of \(\tau _{i,j}\). Equation 2 formally describes the information loss for categorical attributes.

In the context of our previous example in Table 2, we have seven dummy variables per QI. In particular, the dummy variables for the first tuple take the following values: \(\tau ^{Rome}_{1,City}\) = 1, \(\tau ^{Paris}_{1,City}\) = 1, \(\tau ^{Cali}_{1,City}\) = 1, \(\tau ^{Oslo}_{1,City}\) = 0, \(\tau ^{Nara}_{1,City}\) = 0, \(\tau ^{LA}_{1,City}\) = 0, \(\tau ^{York}_{1,City}\) = 0. The total information loss incurred for this particular tuple is NCP(\(\tau '_{1,City}\)) = (3 -1)/(7-1) = 0.33. Similarly, the corresponding information loss for Table 3 is NCP(\(\tau '_{1,City}\)) = 1/6 = 1.66.

$$\begin{aligned} NCP_{cat}(\tau '_{i,j}) = \frac{\sum _{v \in \tau _{i,j}} \tau ^{v}_{i,j} - 1}{|\tau _{i,j}|-1} \end{aligned}$$
(2)

Furthermore, (3) aggregates the information loss of a given anonymized tuple \(q'_i\) by computing the sum across the NPC values of its numerical and categorical values.

$$\begin{aligned} NCP(q'_i) = \sum _{j \in qn_{i}} NCP_{num}(\rho '_{i,j}) + \sum _{j \in qc_{i}} NCP_{cat}(\tau '_{i,j}) \end{aligned}$$
(3)

The Generalized Certainty Penalty (GPC) is a widely used function that measures the information loss in an anonymized dataset The conventional goal of k-anony-mization is to minimize the GCP function (Equation 4) for k-regular graphs. Let d and |Q| denote respectively the total number of QI-values and individuals in the dataset. Notably, the GCP values range from 0 to 1. No information loss (GCP = 0) means that the data remains fully informative and usable without any degradation. Conversely, a GCP value of 1 means total information loss, that is all QI-values have been generalized to a point of being indistinguishable from one another, effectively removing all useful information from the dataset.

$$\begin{aligned} GCP(Q') = \frac{ \sum _{i \in Q'} NCP(q'_i)}{d \cdot |Q'|} \end{aligned}$$
(4)

Alternatively, [6] defines a metric to normalize the information loss of value generalization algorithms by calculating the distance between the original and the average anonymized values. Equations 5 and 6 describe normalized Euclidean distance between the original and anonymized tuples for numerical and categorical attributes. Let us recall that \(\rho _{i,j}\) and \(\rho '{i, j}\) denote the original value of Tuple i and QI j, with \(\rho '_{i,j}\) representing the average of all incoming edges in the anonymization graph for QI j. For instance, in our example in Table 2, \(\rho _{1, age}\) = 40.3 ((50+30+41)/3) and \(\sigma _i\) denotes the standard deviation for the i-th attribute and ensures that the distance between the original and anonymized values (i.e., \(\rho _{i,j}\) - \(\rho '{i, j}\)) is appropriately scaled, especially when the QI-values have different variances.

Alternatively, for categorical attributes let \(\tau _{i,j}\)and \(\tau '_{i,j}\) denote the original (resp. anonymized) value of QI j for QI i with \(\tau '_{i,j}\) representing the mode of all incoming edges in the anonymized graph for QI j. Similarly, for categorical attributes, we use the standard deviation to normalize the distance between the original and anonymized values.

$$\begin{aligned} distance_{num}(q_i, q'_i)= & \sqrt{ \sum _{j \in qn_{i}} \left( \frac{ \left( \rho _{i,j}-\rho '_{i,j} \right) }{\sigma _i} \right) ^2} \end{aligned}$$
(5)
$$\begin{aligned} distance_{cat}(q_i, q'_i)= & \sum _{i \in qc_{j}} \frac{\tau _{i,j}==\tau '_{i,j}}{\sigma _i} \end{aligned}$$
(6)

Equation 7 denotes the aggregated information loss value using the principles of Sanchez et al. in [6] for value generalisations with d attributes and \(|Q'|\) individuals. SIL helps to evaluate the amount of information loss; lower values indicate that the algorithm has preserved more of the original utility of the dataset, implying that the anonymized data is highly representative.

$$\begin{aligned} SIL(Q') = \frac{\sum _{i \in Q'} distance_{num}(q_i, q'_i) + distance_{cat}(q_i, q'_i) }{d \cdot |Q'|} \end{aligned}$$
(7)

In this paper, we limit our attention to the GCP and SIL metrics as these two metrics have been independently evaluated by numerous authors. However, several alternative metrics have been proposed to measure the information loss of anonymized datasets, e.g., [8] describe alternative metrics, including the use of the Manhattan distance and the minimal distortion metrics. We remark that some of the proposed metrics can only be used in the context of homogeneous clutering-based anonymization.

4 Iterated local search

Iterated local search (ILS) is a popular meta-heuristic framework used to tackle combinatorial problems. Algorithm 1 outlines the main components of the ILS algorithm. Broadly speaking, the algorithm works in three phases: (1) calculating an initial solution or anonymized datasetFootnote 2; (2) improving the current solution through local search by performing small changes until a local minimum is reached; and (3) perturbing the incumbent solution (\(Q'\)) to escape difficult search regions. The acceptance criterion decides whether to replace \(\hat{\text {Q}}'^{*}\) with \(\hat{\text {Q}}'\).

Our algorithm requires three parameters: Q, the original dataset; the desired k-value; and the objective function (obj), which measures the information loss of the anonymized dataset. The local search algorithm uses a move operator to explore nearby solutions, using the objective function to guide the transition from one solution to another until a local minimum is reached. The perturbation phase performs a given number of random moves in order to diversify the search.

Algorithm 1
figure a

Iterated Local Search (Q, k, obj).

Algorithm 2 depicts the pseudocode of the local search (LS) algorithm. In this paper, we use the 2-opt operator to explore neighbors of a given solution. The operator replaces two edges \((q_a, q'_b)\) and \((q_c, q'_d)\) from the current solution with \((q_a, q'_d)\) and \((q_c, q'_b)\). It is worth noting that only tuple disjoint edges will maintain valid solutions, i.e., \(\forall _{q_x, q_y \in \{q_a, q_b, q'_c, q'_d\}} x \ne y\), otherwise the resulting graph will not be k-regular.

Figure 2 illustrates the behaviour of the operator, with gray edges denoting the generalizations of the current solution (Fig. 2(a)). Figure 2(b) depicts the set of feasible swaps (dotted lines) involving a given edge \((q_1, q'_2)\) (black edge) in the current solution, i.e., \((q_2, q_4')\), \((q_3, q'_4)\), \((q_4, q'_3)\). As pointed out above, swapping other edges with \((q_1, q'_2)\) results in invalid solutions, e.g., applying the operator with \((q_1, q'_2)\) and \((q_3, q'_1)\) generates a non k-regular graph as the resulting swap removes \((q_1, q'_2)\) and duplicates \((q_1\), \(q'_1)\).

Fig. 2
figure 2

2-opt operator

Figure 2(c) shows the state of the solution after applying the operator to the edges \((q_1, q'_2)\) and \((q_3, q'_4)\). Specifically, the operator replaces \((q_1, q'_2)\) and \((q_3, q'_4)\) with \((q_1, q'_4)\) and \((q_3, q'_2)\).

Algorithm 2
figure b

Local Search (Q, Q’, Obj),

Algorithm 2 depicts the pseudocode of the LS algorithm for arbitrary objective functions (or information loss metrics). In this paper, we focus on the GCP and SIL metrics. The objective function (Obj) is a function itself and computes the information loss of anonymized dataset (e.g., Line 1) and also computes the information loss of individual tuples (e.g., Line 8). Lines 3-26 form the core of the algorithm. It aims to select, uniformly at random, a node \(q'_b\) from the current incumbent list of unexplored nodes. Then, the algorithm selects an improving neighbour by applying the move operator (Line 10). Lines 11-13 keep track of the best move so far and lines 20-25 update the incumbent solution as prescribed by the algorithm. The first improvement option aims to reduce the time complexity of evaluating all neighbours of a given solution by moving to the first improving neighbour. Furthermore, we highlight the importance of the random component to reduce the risk of re-identification when the adversary is aware of the anonymization algorithm.

In lieu of verifying that the local minima is reached by exhaustively checking all moves after each iteration. We maintain a list of nodes that can potentially improve the solution. Line 3 initialises the list with all nodes \(q'_i \in Q'\) and the algorithm populates list with improving nodes that might generate better solutions (Line 25). Furthermore, we remark that list also helps to reduce the number of useless moves by discarding nodes that will not improve the solution.

Fig. 3
figure 3

Move Operator

Figure 3 outlines the behaviour of the move operator. The algorithm exhaustively explores all tuple disjoint edges in the current solution. Let \(\varDelta ^{-}_{t_1}\) and \(\varDelta ^{-}_{t_2}\) (resp. \(\varDelta ^{+}_{t_1}\) and \(\varDelta ^{+}_{t_2}\)) denote the information loss incurred after deleting (resp. adding) a given anonymization edge from the current solution. Thus, \(\delta \) represents the net gain of the move; positive values denote an improvement in the solution. Without loss of generality, Obj2Add computes the information loss (NCP or SIL) incurred by adding a given edge to the current solution M.

5 Operations and complexities

Our proposed framework belongs to the anytime family of algorithms, a type of algorithm designed to find a solution to a problem in a short amount of time, while also being able to incrementally improve the solution as more computational resources become available. In our particular case, the algorithm maintains an incumbent feasible solution (i.e., a k-anonymized dataset) and improves it by moving to neighbour feasible solutions.

5.1 Numerical QIs

For an efficient implementation, it is necessary to incrementally maintain additional data structures for each QI. For numerical attributes, we maintain four variables: \(min_{a,j}\), \(min^{2}_{a,j}\), \(max_{a,j}\), \(max^{2}_{a,j}\) with the min. and max. (resp. second min. and max.) values for a numerical QI of a given tuple \(q'_j \in Q'\). For instance, for the QI age of the first tuple in Table 3, we calculate the minimum value (\(min_{age,1}\)=30), second minimum value (\(min'_{age,1}\)=41), maximum value (\(max_{age,1}\)=50), and second maximum values (\(max'_{age,1}\)=41). These variables help us to compute, in constant time, the impact of removing any edge in the anonymization graph.

\(min_{a,j}\) and \(max_{a,j}\) allow the calculation of the information loss for a given QI in constant time. Alternatively, \(min^{2}_{a,j}\) and \(max^{2}_{a,j}\) allow us to calculate the change incurred by removing an edge \((q_b, q'_a)\) in the the current incumbent solution in constant time. To this end, we replace \(min_{a,j}\) (resp. \(max_{a,j}\)) with \(min^{2}_{a,j}\) (resp. \(max^{2}_{a,j}\)) if \(\rho _{b,j}\) is acting as minimum (resp. maximum) in the anonymized graph for \(\rho '_{b,j}\). The computation of the information loss incurred by adding a new edge \((q_b, q'_a)\) involves to check whether the new edge modifies \(min_{a,j}\) and \(max_{a,j}\) for all \((q_u, q'_j) \in E\) and \(a \in qn_j\).

5.2 Categorical QIs

We can also calculate the information loss for dummy variables (in a categorical QI) in constant time by adding an extra variable \(\tau ^{v'}_{i,j}\) for each dummy variable \(\tau ^{v}_{i,j}\) with the number of input edges generalising the variable (Equation 8).

For instance, in the second tuple in our working example (Table 3), we have \(\tau ^{Paris'}_{2, city}\)=1 as (\(q_2\), \(q'_1\)) is the only generalization match for the dummy variable Paris. However, for our ninth tuple in the same example we have \(\tau ^{Rome'}_{9, city} = 2\) as there are two generalization matches, i.e., (\(q_9\), \(q'_9\)) and (\(q_1\), \(q'_9\)). We calculate the information loss incurred by removing an edge \((q_b, q'_a)\) of a given solution by decrementing \(NCP(\tau _{b,j})\) if \(q'_j\) is the only supporting variable of the dummy variable (\(\tau ^{v}_{b,j}\)=1), i.e., \(q_b\) represents the only supporting variable for the dummy variable. For instance, \(q_2\) is the only supporting edge for \(\tau ^{Paris'}_{2, city}\) as it denotes the only generalization match for the second anonymized tuple. Similarly, we increase \(NCP(\tau _{b,j})\) when attempting to add a new edge \((q_d, q'_a)\) and \(q_d\) becomes the only supporting value for a categorical attribute \(\tau ^{v}_{b,j}\), i.e., \(\tau ^{v'}_{b,j}\) is 0 before adding the edge.

$$\begin{aligned} \forall _{q'_i \in Q'}&\forall _{ v \in \; qc_i } \nonumber \\&\tau ^{v'}_{i,j} = \sum _{(q_u, q'_j) \in E} \tau _{i,j} == v \end{aligned}$$
(8)
Table 5 Time complexities with and without incremental calculations for the GCP and SIL metrics

5.3 Complexities

Table 5 summarises the time complexities of the main operations. Incremental operations improve the time complexity of calculating the information loss value for the GCP and SIL metrics. This approach eliminates the need for the operator to scan the incoming edges of an anonymized node \(q'_i \in Q'\) to calculate the NCP values, as the algorithm precomputes the critical values, e.g., min. and max. QI-values. We note that although the non-incremental restore and swap operations can be performed in constant time, the Delete and NCP2Add (resp. SIL2Add) operations still need to be performed for \(n^2\) neighbour solutions.

To efficiently calculate the GCP value, we incrementally maintain the following NCP values: max (i.e., objective), second and third max. (\(max'\) and \(max''\)). This ensures the same complexities as for the GCP objective.

For the SIL objective, we maintain the summation (\(\Upsilon _{a,i}\)) for each tuple i and QI a. For instance, for the QI age of the second individual in Table 3, we maintain \(\Upsilon _{age,2}\)=121 (30+50+41). This way, SIL2Add (resp. Delete) appropriately increases (resp. decreases) \(\Upsilon _{age,2}\).

We remark that the the only difference in the runtime complexity for NCP and SIL lies in the Obj2Add operation. Unlike the NCP2Add operation that needs to visit the k incoming edges of a given node, the SIL2Add only needs to increase and decrease \(\Upsilon _{a,i}\) without visiting the incoming edges of the node.

It is important to note that most authors overlook the computational cost of evaluating the information loss derived from anonymizing a given tuple. For instance, Liang and Samavi [10] indicates that the runtime complexity of the l-greedy algorithm is \(O(n^2)\) regardless of the choice of k, however, the authors ignore the complexity of computing the information loss incurred from anonymizing a given tuple. Furthermore, the author’s implementation of the algorithm shows that there is an overhead of at least \(O(k \cdot d)\) for each iteration of the algorithm as it needs to constantly recalculate the min and max values of each cluster. The overhead might be even more significant for categorical attributes, as the algorithm needs to check the value of each dummy variable to compute the information loss.

5.4 Perturbation phase

The perturbation phase aims at diversifying the search to escape from difficult areas of the search space. This increases the chances of finding better solutions rather than getting stuck in local minima In this paper, we perturb a given solution by performing n random swaps of two randomly selected tuple disjoint edges.

5.5 Initial solution

As mentioned earlier, the ILS algorithm requires a valid initial solution (line 1 - Algorithm 1). The algorithm must be capable of starting with the output of any existing algorithm for homogeneous or heterogeneous k-anonymization. For heterogeneous k-anonymization heuristics, such as the Hungarian-based method, the output can be used directly without additional processing. For homogeneous algorithms, the following scenarios need to be considered:

  • for each cluster c with exactly k tuples: the solution can be used without additional processing as the graph is k-regular;

  • for each cluster c with more than k tuples: we propose to iteratively and randomly remove an edge \((q_a, q'_b)\) and delete an augmenting path starting at \(q_a\) and ending in \(q'_b\). We perform |c| - k iterations until reaching a k-regular graph.

6 Evaluation

In this paper, we consider the following three datasets:

  • IPUMS: similar to [10], we consider the IPUMS USA dataset, which contains hundreds of thousands of tuples of U.S. census data. We randomly selected up to 40,000 tuples, each containing the following four attributes: FAMUNIT, FAMSIZE, ELDCH, and YNGCH. This is the largest dataset used to evaluate k-anonymization algorithms [10].

  • Adult: a commonly used dataset from the Kaggle repository to predict whether the income of a given person exceeds $50K per year. e randomly selected up to 10,000 tuples with all the 14 available attributes. This is the most popular dataset used to evaluate k-anonymization algorithms.

  • Youtube: this dataset contains Youtube video and channel metadata, including total views, likes/dislikes, comments/views, and other ratios, to analyze the statistical relation between videos and form a topic tree. We selected up to 10,000 tuples with all the 23 available attributes.

We evaluate the current state-of-the-art heuristic solutions for k-anonymization, including the k-member, l-greedy, and Hungarian algorithms in C++, as well as the Split & Carry implementation from [10] with Gurobi. We ran our local search algorithm five times (each time with a different random seed) and computed the average performed of our LS algorithm to improve the performance of the k-member, l-greedy, and Hungarian algorithms on an Intel Core i5-8265U machine at 1.6 GHz and 8 GB of RAM running Ubuntu. Due to licensing limitations, we performed our Split & Carry experiments on a MacBook Pro (M1 - 2020) with 16 GB of RAM running macOS Big Sur. Notably, the MacBook Pro is significantly more powerful than the Intel machine used for other experiments

Our C++ l-greedy implementation is over 30 times faster than the author’s original Python implementation, justifying our focus on our custom implementations For this reason, we focus on our home-made implementations. Figure 4 illustrates the impact of using our data structures on the l-greedy heuristic for our three reference datasets with the GPC objective function, k=5, and 5,000 and 10,000 tuples. Notably, our incremental computation helps to speedup the performance in all the scenarios , e.g., for IPUMS and Youtube datasets with respectively 4 and 23 attributes the algorithm reports an improvement of 20% and 44% (with 10,000 tuples). We exclude the runtime performance with 1,000 tuples as our implementation solves the problem in less than 1 sec.

Fig. 4
figure 4

l-greedy execution time with and without incremental calculations

6.1 Relative performance improvement

In this paper, we calculate the Relative Performance Improvement (RPI) of using our LS algorithm to improve the quality of the anonimized solution of the k-member, l-greedy, and Hungarian heuristics.

$$\begin{aligned} RPI = 100 \times \frac{\textit{Initial-heurisitic} - \textit{LS}}{\textit{Initial-heurisitic}} \end{aligned}$$
(9)

6.2 Empirical evaluation

We start our empirical evaluation with Table 6. In these experiments, we use a timeout of 100 seconds to solve each subproblem for the Split & Carry algorithm, and 300 secs. (resp. 1,200 secs.) for the remaining algorithms with 1,000 and 5,000 tuples (resp. 10,000 secs.). TO indicates that the algorithm is unable to find a feasible solution within the time limit. We remark that there is no global time limit for Split & Carry and a TO, in this case, indicates that Gurobi is unable to find feasible solutions for at least one subproblem.

Each cell of the table reports the GCP information loss of the reference heuristic (top) and the LS improved solution (bottom). These results are consistent with the literature (see [7] and [10]), indicating that the Hungarian algorithm reports better solutions than l-greedy and k-members with a significant runtime overhead. Interestingly, Split & Carry slightly outperforms l-greedy and k-members in five out of the six instances it can find feasible solutions. However, LS + l-greedy and LS + k-members always outperform Split & Carry. The performance improvement can be seen in both the quality of the solutions and the runtime (with k=5). As pointed out in [10] the Split & Carry algorithm is unable to find feasible solutions within reasonable times for k>5.

These results are consistent with the literature, supporting the conclusion that the Hungarian-based algorithm performs better than k-members, l-greedy, and Split & Carry on small-size instances. Moreover, l-greedy outperforms k-members in 9 out of 14 experiments. To the best of our knowledge, our solution approach is the first algorithm to outperform the Hungarian algorithm on small-size instances and the remaining three algorithms on large-size instances. This indicates the robustness and efficiency of our approach across different instance sizes and highlights its potential as a better alternative in the context of GCP information loss minimization.

Table 6 Each cell reports the performance of the heuristic (top) and the improved LS solutions (bottom) with the IPUMS dataset and the GCP metric (x 100) – TO indicates that the algorithm reached a time out

Figure 5 shows the progress of information loss reduction in our anytime LS algorithm for our IPUMS dataset with 1,000 tuples and k=5. Unlike k-members and l-greedy, our LS algorithm provides a valid k-anony-mized solution at anytime during its execution. As it can be seen the algorithm quickly improves the initial solution (computed with state-of-the art algorithms), this way the algorithm gradually improves the solution. In this work, we use a timeout as the main stoping criterion. However, our algorithm usually finds the best solution before the time limit. In this particular example, Hungarian + LS finds the best solution in about 3 secs.; k-members + LS finds the best solution in about 210 secs; and l-greedy + LS finds the best solutions in about 190 secs. Interestingly, in this particular case, k-members outperforms l-greedy, however, l-greedy + LS outperforms k-members + LS.

Fig. 5
figure 5

Time vs. solution quality for the IPUMS dataset with 1,000 tuples

Fig. 6
figure 6

Improvement for the IPUMS dataset and the GCP metric - Original algorithms vs. Improvement with LS

Figure 6 evaluates the scalability of the algorithm with the IPUMS dataset, the GCP metric, 10,000 tuples, and k=3. In this figure, we focus on the improvements of LS, i.e., k-members vs. LS + k-members and l-greedy vs. LS + l-greedy. We note that the Hungarian algorithm is unable to handle these datasets within reasonable times. The RPI for the k-members algorithm goes from 5.2% (1,000 tuples) to 8.5% (40,000 tuples). Similarly, the improvement for the l-greedy algorithm goes from 7.2% (1,000 tuples) to 6% (40,000 tuples).

Fig. 7
figure 7

Performance improvement for the GCP metric with 10,000 tuples

Figure 7 shows the performance of the algorithm with the three reference datasets as the privacy level increases from k=3 to k=20. We recall that the probability of reidentification decreases as we increase the value of k; thus, the probability of reidentification decreases from 0.33 (k=3) to 0.05 (k=20). As it can be seen the relative performance improvement for the IPUMS dataset ranges from 11.4% and 5.6% (k=3) to 9.7% and 0.4% (k=20) for the k-members and l-greedy algorithms, respectively.

We now switch our attention to the SIL information loss metric. We remark that most of algorithms have been extensively tuned for a single information loss metric, e.g., [7, 10] focusses on the GCP metric and [6] focusses on the SIL metric. For instance, [10] reorders the dataset to favour the GCP metric, however, it is currently unknown the impact of that on alternative metrics.

In a similar manner as before, Fig. 8 evaluates the performance improvement on the SIL metric as we increase the size of the dataset. In this case, the improvement of the LS algorithm remains above 30% in all the experiments with both starting solutions (i.e., with l-greedy and k-members). In particular, the RPI value goes from 40% to 43.7% for l-greedy and 32.5% to 54.4% for k-members. Figure 9 shows that the improvement remains nearly stable as we increase the k-value, interestingly, the RIP goes above 40% for k=5 and k=10 for both algorithms on the IPUMS dataset.

We also note that the performance improvement is similar for both l-greedy and k-members, as both algorithms produce nearly the same solutions with the SIL metric on the Adult and Youtube datasets. We attribute this to the fact that the dedicated sorting function of the l-greedy algorithm has little to no impact on the SIL metric for these datasets.

Fig. 8
figure 8

RPI for SIL metric and the IPUMS dataset - Original algorithms vs. Improvement of the solutions with LS

Fig. 9
figure 9

Performance improvement for the SIL metric with 10,000 tuples

6.3 Discussion

The results of our evaluation reveal several key insights into the performance of the tested k-anonymization algorithms. The LS algorithm consistently outperforms the k-member, l-greedy, and Hungarian heuristics. As pointed out in the literature, the Hungarian algorithm, while providing better solutions than k-members and l-greedy, has a significant runtime overhead, making it less practical for larger datasets.

Specifically, our LS algorithm is up to 4.7% better than the Hungarian heuristic on small-size instances with 1,000 tuples and the GCP information loss metric. Similarly, our algorithm is more than 12% better than k-members and l-greedy with 20,000 tuples and the GCP information loss metric. We note that these algorithms have been extensively tuned for the GCP metric. On the SIL metric, our algorithm is up to 54% better than k-members in our experiments with 20,000 tuples.

The Split & Carry algorithm shows potential by outperforming l-greedy and k-members in certain instances, but its limitations become evident with increasing k-values, where it fails to find feasible solutions within reasonable times. This highlights the robustness of the LS-enhanced heuristics, particularly LS + l-greedy and LS + k-members, which not only outperform Split & Carry in solution quality but also demonstrate better runtime performances.

The scalability of the LS algorithm is another notable aspect. The RPI remains significant as the dataset size increases, demonstrating its ability to handle large datasets. Furthermore, the incremental data structures proposed in this paper greatly improve the performance, as shown in Fig. 4, making it a viable option for practical applications.

Additionally, our empirical evaluation outlines the versatility of the LS algorithm in improving solutions across different datasets and metrics. For example, the improvement in the SIL metric remains consistently high across various dataset sizes and k-values, as depicted in Figs. 7 and 9. This consistent performance across multiple metrics and datasets reinforces the utility of our approach in a variety of scenarios.

7 Conclusions

In this paper, we have presented an efficient local search framework for heterogeneous k-anonymization. We presented a move operator to tackle the problem along with its computational complexity and incremental evaluation of the neighbourhood and the objective function. The effectiveness of our approach is demonstrated by experimenting with widely used datasets from IPUMS and the Kaggle repositories.

Our extensive empirical evaluation shows that our framework outperforms the current state-of-the-art by anonymizing the datasets with the same privacy level and lower information losses. Notably, our algorithm showcased exceptional performance by significantly reducing the amount of information loss by up to 40% (resp. 30%) compared to state-of-the-art algorithms. This was particularly evident in its effectiveness on the IPUMS dataset, which comprised 10,000 tuples with a de-anonymization probability of 20% or k=5 (resp. 5% or k=20).

In the future, we plan to extend our framework with stronger privacy definitions such as l-diversity and t-closeness. Effectively, this means the anonymization algorithm will also take into account the amount of sensitive information disclosed of each group of indistinguishable individuals. Furthermore, we also plan to tackle the problem as a multi-objective optimization problem to find a proper trade-off between multiple definitions of privacy at the same time.