1 Introduction

Outsourcing services and data to external providers has become increasingly attractive with the advent of cloud computing. However, while benefits in terms of scalability, elasticity and cost effectiveness are expected, data outsourcing also induces risks. One of the most important risks constitutes privacy loss of individuals [1]. The use of Database-as-a-Service (DaaS) offerings of external storage providers (SPs) often implies loosing control over the outsourced database, as even trustworthy SPs can be prone to third party attacks. To minimize the risk of data disclosure, data records can be encrypted by a trusted client before being outsourced. To retrieve records, the trusted client has to retrieve the encrypted records and decrypt them. As encrypted records are indistinguishable to the SP, indexes are necessary to efficiently retrieve encrypted records that match certain criteria. These indexes, however, can undermine the privacy of individuals again.

One way to preserve the privacy of individuals in indexes is to anonymize the outsourced indexes. If an index contains two records (John, Flu) and (John, None), John’s privacy might be considered preserved as the illness of the individual named John cannot be determined with a probability higher than \(\frac{1}{2}\). This anonymization concept where each value assignment that can be used to identify individuals has to map to at least \(\ell \) assignments of sensitive values is referred to as \(\ell \)-diversity [2]. As \(\ell \)-diversified indexes contain plaintext rather than encrypted records, the SP can evaluate queries for records that match certain criteria.

Existing anonymization approaches for the DaaS setting [35] that generate an \(\ell \)-diverse representation of a set of data typically do not address the creation of indexes on the anonymized data. The efficiency of anonymized indexes that contain anonymized plaintext records varies a lot depending on the query, the index content and the anonymity requirements, as we show in this paper. To determine which indexes should be created and used, approaches to assess the efficiency of anonymized index tables for specific queries are necessary.

In this paper, we propose Dividat (\(\ell \)-DIVersIfied DATabases), an approach that makes use of \(\ell \)-diversified indexes to make query execution more efficient. In particular, the contributions of this paper are the following:

  • A performance behavior inspection of \(\ell \) -diversified indexes to show that the efficiency of \(\ell \)-diversified index tables heavily depends on the query, the content of the index and the anonymity requirements.

  • Indexing strategies to determine which anonymized index tables should be created for a given workload and which index tables should be used.

  • Performance models to estimate the performance of anonymized indexes a-priori without actually creating each possible index.

The paper is structured as follows: In Sect. 2 anonymization notions and concepts are introduced and related work is summarized. In Sect. 3 we investigate the influencing factors for the performance of anonymized index tables and show how optimized indexing strategies can be derived from performance models. We validate our models in Sect. 4 and discuss our approach in Sect. 5. Finally, the paper is concluded in Sect. 6.

2 Fundamentals and Related Work

2.1 Anonymity Notions and Techniques

In order to quantify the level of anonymity, several anonymity notions have been proposed [6]. Relevant in the context of this paper is the notion of \(\ell \)-diversity [2]. The general concept of \(\ell \)-diversity is applied for scenarios in which data is organized in a set of records that link identifiers to values of an attribute that is considered to be sensitive. The set of records is \(\ell \)-diverse if for each identifier there are at least \(\ell \) records with \(\ell \) distinct sensitive values.

To deal with the fact that the set of identifying attributes might depend on the attacker (and his/her background knowledge), the formal definition of \(\ell \)-diversity builds on the notion of a quasi-identifier (QID). A QID is a set of attributes that can be used as if it were an identifier because of the attacker’s background knowledgeFootnote 1.

All records that have the same QID value form a so-called QID-block. The notion of \(\ell \)-diversity requires that each QID-block has to be linked to at least \(\ell \) different, well-represented sensitive attribute values. Thus, attackers that can link individuals to a QID value via background knowledge can determine an individual’s correct sensitive attribute value only with a probability of \(\frac{1}{\ell }\). Formally, \(\ell \)-diversity is defined as follows [2]: A QID-block is \(\ell \)-diverse if it contains at least \(\ell \) well-represented values for the sensitive attribute S. A table is \(\ell \)-diverse if every QID-block is \(\ell \)-diverse.

To anonymize datasets to adhere to privacy notions like \(\ell \)-diversity, a variety of approaches such as anatomization [4], generalization [8], permutation [9], perturbation [10, 11] and suppression [8] exist. For the DaaS use-case where data is repeatedly updated and queried for records that satisfy specific conditions, the concept of anatomization is the prevalent anonymization method [35].

Table 1. Anatomization example. QID: {Name, ZIP}, sensitive attributes: {ZIP, Illness}

The focus of this paper is on optimizing indexes for anatomized data rather than on presenting yet another anonymization approach. Therefore, we build on existing approaches that ensure that the SP only has an anatomized view on the data [35]: All sensitive attributes that are not contained in the QID (SensOnly in the following) are stored in one table, each sensitive attribute that is contained in the QID (QidSens) is stored in a separate table and the remaining attributes (QidOnly) are stored in another table. Records are anonymized in batches of \(\ell \) records. Each batch is assigned a group identifier (GID) to not lose the relations of the attribute values entirely. Based on the GID, each attribute that is contained in the QID maps to at least \(\ell \) different values that are sensitive. Thus, the anatomized data representation can be considered \(\ell \)-diverse.

Example: The relation shown in Table 1a can be anatomized as shown in Table 1c to guarantee 2-diversity with a QID {Name, ZIP} and sensitive attributes {ZIP, Illness}. For instance, the QID attribute combination “John, 12345” can be mapped on 2 different values for the sensitive attribute Illness. However, information is lost by anatomization: A query for the ZIP code of the records with the name “John” would return “12345” and “12349”.

Approaches that provide anonymized DaaS remedy this loss of information by outsourcing encrypted versions of the records along with the anatomized data [4, 5]. In the previous example, if the attribute values of GID 1 match a query, the encrypted records Enc(John, 12345, Flu) and Enc(Eve, 12349, Headache) can be retrieved from Table 1b and sent back to the trusted client of the query issuer. The trusted client then has to decrypt them and potentially discard false positive records that do not match the original query.

2.2 Related Approaches

The problem of confidential database outsourcing was addressed in the DaaS community by using encryption [12], confidentiality preserving indexing structures [1315] and partitioning of attributes across multiple SPs [1618]. While encrypting a database preserves data confidentiality and therefore the anonymity of individualsFootnote 2, indexing structures are needed for efficient query execution. Confidentiality preserving indexing structures can be used to query outsourced data efficiently, but can also induce information flows that allow attackersFootnote 3 with a certain background knowledge to reveal sensitive attributes [19].

Privacy preserving data publishing aims at preserving the privacy of individuals by anonymizing the data before publishing it. Traditional proposals for anonymized data publishing [6] focus on anonymizing and publishing a batch of data once. Recent approaches [2023] also address the problem of publishing multiple versions of a dataset or even data streams, allowing to perform insert, delete and update operations on a published dataset.

Nergiz et al. [3, 4] propose to leverage the findings of the data publishing community in DaaS scenarios. Their concept builds on anatomization of a data table into QID and sensitive attributes. By storing newly inserted records in an encrypted table that is completely transferred to and searched by a trusted client when evaluating queries, Nergiz et al. can guarantee that queries return newly inserted records. Vimercati et al. [5] propose to fragment databases and only loosely couple the fragments by assigning each attribute value a group and mapping the groups of different attributes on each other. Unlike Nergiz et al. [3, 4] they support sensitive attributes that are contained in the QID. They showed that the anonymity properties of such anatomized databases are comparable to the \(\ell \)-diversity notion. Dividat extends the approaches of Nergiz et al. and Vimercati et al. by examining in which cases it is beneficial for query execution performance to build pre-joined index tables based on the anatomized data.

Query optimization approaches for classical, non-anonymized databases [2426] aim at finding an optimized query execution plan that minimizes the induced cost. Other approaches make use of materialized views to execute queries more efficiently that would otherwise require expensive joins and projections [27, 28]. In classical query optimization the query has to be evaluated as efficiently as possible and return correct results. Correctness is not a necessary condition in our case as a trusted client post-processes the results. With Dividat the SP can partially evaluate a query and lets the trusted client evaluate the rest of a query if it is expected to be more efficient. Our approach also takes into account the \(\ell \)-diversified data structure that originates from anatomization to a-priori determine which materialized views should be created before the data is even present. However, Dividat builds on classic approaches in the sense that it leverages the query optimization capabilities of the SP’s database management system.

3 Performance-Optimized Indexes

Executing queries based on anatomized data as shown in Table 1c requires to join tables. Especially for big datasets, joining is considered expensive in terms of performance [29]. To avoid joins, materialized index tables that contain all attributes that are relevant to execute a query can be built by pre-joining the anatomized data. Such materialized index tables from anatomized data can increase query execution performance. However, as we will show in Sect. 3.1 a materialized index table that contains all relevant attributes for a query is not always the most efficient choice and choosing the optimal index table for a query can have a large performance benefit. In Sect. 3.2, we show how to develop an optimized indexing strategy. To decide which index tables should be created by the SP and which of the created index tables should be used to execute a given query, the overheads induced by different index tables need to be quantified without actually executing the query. We propose models that can be used to quantify these overheads in Sect. 3.3.

3.1 Possible Performance Savings

To investigate the potential performance benefits by optimizing the usage of anonymized indexes, we conducted empirical measurements. All measurements presented in this section are based on a PostgreSQLFootnote 4 database run on a machine with 4 GB RAM and a 2.50 GHz QuadCore CPU. The data on which we executed queries contained 30.000 records with three attributes, for which values were chosen uniformly and independently at random: gender with two possible values, synth1 and synth2 with a varying value domain that depends on the measured scenario. The anonymity requirements were set to \(QidOnly=\{gender\}\), \(QidSens=\{synth1\}\) and \(SensOnly=\{synth2\}\). We can show that optimizing the usage of index tables can improve performance even in very simple scenarios such as the investigated one.

Definition (Single Selection): A query performs a single selection on an attribute iff it selects only records that contain a single specific value for this attribute. Example: The query SELECT *...WHERE ZIP=12345 AND Phone=34622 AND Illness=None performs single selections on ZIP, Phone and Illness.    \(\Box \)

Definition (Multi Selection): A query performs a multi selection on an attribute iff it selects only records that contain one out of a set of values the attribute can take. Example: The query SELECT * ...WHERE ZIP=12345 AND Phone \(<\) 35638 AND (Illness=None OR Illness=Flu) performs multi selections on Phone and Illness.    \(\Box \)

Fig. 1.
figure 1

Performance differences between index tables with varying parameters.

First, we investigate the dependency of query latency on the number of queried values. We measured the execution time of random queries that contained single selections on gender and multi selections (based on \(\le \) and \(\ge \) operators) selecting a varying number of values for both synth1 and synth2 based on different index tables. The results are shown in Fig. 1a. Executing the queries on a non-anonymized plaintext table provides the best results, but does not enforce the privacy requirements. Executing the queries based on an 3-diversified index table that contained gender and synth1 (gender-synth1 in Fig. 1a) performs best if less than 7 values are selected each for synth1 and synth2. Otherwise, it is most efficient to evaluate the query on an index table that does not contain synth1 and synth2 but just gender. The measurements show that the optimal index table choice heavily depends on the query and performance differences of more than a factor of 5 exist between index tables (see gender vs. gender-synth1-synth2 for 16 selected values each for synth1 and synth2 in Fig. 1a).

Second, we investigate the dependency of the query latency on the data’s structure. We measured the latency of random queries on varying datasets that differed in the number of values in the domain of the attributes synth1 and synth2. The results are shown in Fig. 1b. With a bigger value domain for synth1/synth2, the latencies decrease as fewer records match the executed queries. The measurements show that the optimal index table choice for the executed query heavily depends on the value domain of synth1 and synth2. For instance, executing the query based on the anonymized index table containing gender, synth1 and synth2 (gender-synth1-synth2 in Fig. 1b) is the worst choice if synth1’s and synth2’s value domain includes 5 values each but is the best choice for value domains that contain 15 values each or more.

Third, we investigate the dependency of the query latency on the diversity factor \(\ell \). We measured the execution time of random queries that just contained a single selection on each attribute (e.g., ...WHERE gender=male AND synth1=1 AND synth2=2). The measured query latencies for a varying value domain of both synth1 and synth2 are shown in Fig. 1c. The measurements show that the optimal index table choice depends on synth1’s and synth2’s value domain and the anonymization factor \(\ell \). It is more efficient to use the anonymized index table gender-synth1-synth2 than the gender index table if \(\ell =5\) and synth1’s as well as synth2’s value domain contains 10 values each. However, if \(\ell =10\), it is more efficient to execute the query based on the gender index table.

The measurements of the exemplary scenario presented in this section show that the usefulness of anonymized index tables to execute queries more efficiently heavily differs and an anonymized index table that contains all attributes for a given query is not necessarily the best choice. We will provide the rationale why this is the case in Sect. 3.3.

3.2 Index Optimization Strategies

The conducted measurements show that the efficiency of \(\ell \)-diverse index tables heavily depends on (a) the data’s structure, (b) the \(\ell \)-diversity requirements and (c) the executed queries. Thus, to optimize the usage of indexes, information on these influencing factors is necessary. While the \(\ell \)-diversity requirements are well specified and statistics can be kept by the SP to derive the data’s structure, the queries executed in the future are in general not predictable. In this section, we propose strategies on how to adaptively determine which index tables should be used to execute a specific query and which index tables should be created.

Query Execution Strategy: As the optimal index table to execute a query depends on the query itself, for each query \(q\) that is executed an optimized decision has to be made on which existing index table \(I\) induces the lowest query latency. To make this decision, the expected query latency \(l(I,q)\) needs to be estimated for each index table \(I\) and the index table with the minimum expected latency needs to be chosen. As this decision has to be made on-demand upon incoming queries, the latency estimation has to be performed efficiently and actually executing the query to determine the index tables’ efficiency is not an option. A model that allows to a-priori assess the expected query latency \(l(I,q)\) for each index table is needed. We propose such a model in Sect. 3.3.

Index Creation Strategy: The SP stores anonymized records in anatomized tables that allow to create index tables on-demand without interacting with the trusted client. The choice which set of index tables can be considered optimal and should be created does not necessarily depend only on the expected query workload \(\mathcal {Q}\) but also on the storage overhead induced by the index tables. Both the storage overhead \(N_I\) of a single index table \(I\) and the query latency \(l(I, q)\) that is induced when executing a query \(q\) on index table \(I\) can be estimated based on models we propose in Sect. 3.3. Based on these models, the query latency induced by a given set of index tables \(\mathcal {I}\) for a given workload \(\mathcal {Q}\) can be calculated as \(\sum _{q\in \mathcal {Q}}\min _{I \in \mathcal {I}}l(I,q)\). Furthermore, the storage overhead of a set of index tables \(\mathcal {I}\) can be calculated as \(\sum _{I\in \mathcal {I}}N_I\). Both metrics can be considered by applying a weight \(w_l\) to the latency overhead and a weight \(w_s\) to the storage overhead according to the user’s individual preferences:

$$\begin{aligned} o(\mathcal {I}, \mathcal {Q}) = w_l\cdot \frac{1}{|\mathcal {Q}|}\sum _{q\in \mathcal {Q}}\min _{I \in \mathcal {I}}l(I,q) + w_s\cdot \sum _{I\in \mathcal {I}}N_I \end{aligned}$$
(1)

To find a set of index tables \(\mathcal {I}\) that minimizes the overhead \(o(\mathcal {I}, \mathcal {Q})\), the query workload \(\mathcal {Q}\) that will be executed needs to be known. We will propose two options to get hold of the workload in this section:

Option 1 - User interaction: One way to address the problem is to let the user specify the queries that need to be efficiently executable and for which index tables should be created. The user can state precise queries that should be executable as efficiently as possible (e.g., ...WHERE 2 \(<\) age \(<\) 11 AND name = john). However, in reality it is often not possible for the user to specify the precise queries that will be executed in the future. In this case, it suffices if the user can specify abstract queries, i.e., the expected average number of selected values for each attribute \(|q_a|\) for each query (e.g., \(|q_{age}|=8, |q_{name}|=1\)).

We propose an efficient algorithm that runs in \(O(|A_q|)\) (where \(A_q\) is the set of relevant attributes for query \(q\)) to determine the index table \(I\) that minimizes the overhead function \(o\) for a single query \(q\) in [30]. SPs can use the algorithm to determine the optimal index table for each user specified query and create the tables from the anatomized view.

Option 2 - Dynamic Self-Optimization: In some cases it might be hard for the user to specify beforehand the average number of selected values per attribute for a query or even to specify the kind of queries to be executed. Another way to address the problem is monitoring the query workload and adapting the indexing strategy to it. To characterize the workload, a sliding window of the last \(x\) queries can be taken as reference for future queries. Based on these queries, a set of index tables that minimizes the overhead function in Eq. 1 can be determined. To limit the storage overhead and to avoid excessive performance loss for inserting, updating and deleting records in all index tables, the number of index tables should be limited to a certain number. The resulting optimization problem can be solved by the SP in regular intervals to check whether the indexing strategy should be adapted. While solving this optimization problem is out of scope of this paper, we provide a formal description of the optimization problem in [30] and leave a deeper inspection of the problem for future work.

3.3 Performance Models

To evaluate Eq. 1, one needs to be able to determine the storage overhead \(N_I\) of an index table and the query latency \(l(I,q)\) for executing a query based on an index table. In this section, we propose models that capture the trade-off between \(\ell \)-diversity requirements and induced storage as well as query latency overheads. We propose a storage overhead model to a-priori quantify the storage overhead \(N_I\) a given index table induces and a query latency model that estimates the execution time \(l(I,q)\) of a query \(q\) on a given index table \(I\).

Storage Overhead Model (SO-model): By multiplying the number of index table records with the storage space needed to store a single record, the storage overhead induced by an index table \(I\) can be exactly calculated. For an index table \(I\) that contains the set of attributes \(A_I\) the number of records \(N_I\) that are contained in the index table can be calculated as:

$$\begin{aligned} N_I = \ell ^{|QidSens\bigcap A_I|+\delta (|SensOnly\bigcap A_I|)+\delta (|QidOnly\bigcap A_I|)} \cdot N_M \end{aligned}$$
(2)

where \(\delta (x)=\{ \begin{array}{l} 1 \text { if } x > 0\\ 0 \text { else}\end{array}\) and \(N_M\) is the number of outsourced records.

Index table \(I\) can be built by joining all tables of the anatomized view on the GID that contain attributes included in \(A_I\). For each sensitive attribute that is contained in the QID and the index table (\(QidSens\bigcap A_I\)), the attribute’s QidSens table in the anatomized view that contains \(\ell \) records for each GID has to be joined. Each such table increases the number of entries in the result of the join by factor \(\ell \). Furthermore, if a sensitive attribute that is not part of the QID is contained in \(A_I\), i.e., \(SensOnly\bigcap A_I\) is not empty, the table containing the sensitive attributes has to be joined as well. As this table contains \(\ell \) records for each GID, this increases the result of the join by factor \(\ell \).

Query Latency Model (QL-model): The query latency, i.e., the time needed to retrieve the matching records from an index table, depends on many factors like the used database system, the disk latency and the hardware configuration. We abstract from most of them by making the following assumption.

Assumption I: The query latency \(l(I, q)\) of executing query \(q\) on index table \(I\) linearly depends on the number of records \(r(I, q)\) that are contained in the index table and match query \(q\).

$$\begin{aligned} l(I, q) = {c\cdot r(I, q) + f} \end{aligned}$$
(3)

We show in Sect. 4.1 that Assumption I holds.

To apply the QL-model shown in Eq. 3, the parameters \(c\), \(f\) and \(r(I, q)\) have to be determined. We propose an analytical model that can be used to estimate the number of records \(r(I, q)\) in an index table \(I\) that match query \(q\) later (see MR-model). Modeling \(c\) and \(f\) analytically like \(r(I, q)\) is hard, as they depend on a variety of factors (e.g., indexing structures, disk latency, caching effects, database implementation, etc.). To determine the parameters for an existing index table \(I\), execution times (\(e_1\),\(e_2\),...,\(e_n\)) of multiple queries \(q_1\), \(q_2\),..., \(q_n\) that return a different number of records (\(r(I, q_1)\ne r(I, q_2)\)) can be measured. Simple linear regression can then be used to determine parameters \(c\) and \(f\) that provide the best fit for the following system of equations: \(e_i = c\cdot r(I, q_i)+f, i=1\ldots n\).

The resulting parameterized model can be used to determine a query execution strategy for incoming queries. To determine an index creation strategy, the parameters cannot be determined for each potential index table in most cases as this would imply to actually build each index table.

Assumption II: The parameters \(c\) and \(f\) are independent from the data’s structure, i.e., they only depend on the size of the index table \(N_I\).

Based on Assumption II, to approximate the parameters for non-existing index tables, \(c\) and \(f\) can be determined for each possible index table size \(N_I\) by building an according index table and measuring the query latency. The value domain of \(N_I\) is small, as \(N_I\) directly depends on the number of index table attributes in SensOnly, QidOnly and QidSens (see Eq. 2). The calculated \(c\),\(f\) parameter values can be reused for all index tables with \(N_I\) entries without having to build each of them. In Sect. 4.1 we show that Assumption II holds.

Matching Records Model (MR-model): The query latency to evaluate a query \(q\) on an index table \(I\) depends on the number of records \(r(I, q)\) contained in the index table that match the query. In the following we will provide analytical models that can be used by the QL-model to a-priori estimate the number of returned records \(r(I, q)\) of a query \(q\) that is evaluated based on a given index table \(I\). For simplicity, we initially make the following assumption.

Assumption III: The attributes’ values are independently distributed. We discuss in which cases this assumption can be neglected in Sect.  4.2.

For simplicity and without loss of generality we presume that queries have a specific form that every query can be mapped on: For each contained attribute, the query filter contains conditions that are concatenated with OR operators. Conditions on distinct attributes are concatenated with AND operators.

Multi Selection Model: Given a set of data with \(N_M\) records where all attribute values are uniformly and independently distributed and a required diversity of \(\ell \). The expected number of matching records for a query \(q\) that is evaluated on an index table \(I\) can be calculated as:

$$\begin{aligned} r(I, q)={\prod _{a\in A_I}\frac{|q_a|}{|V_a|}\cdot N_I} \end{aligned}$$
(4)

where \(V_a\) is the set of possible values for attribute \(a\), i.e., \(|V_a|\) denotes the cardinality of attribute \(a\) that is contained in statistics maintained by the SP. The variable \(q_a\) denotes the set of values the query selects for attribute \(a\). Selectivity is the percentage of records that contain specific attribute values. As attribute \(a\) is uniformly distributed, the average selectivity of \(a\) amounts to \(\frac{|q_a|}{|V_a|}\). Furthermore, as the attribute values are assumed to be independently distributed, the combined selectivity of multiple attributes concatenated with AND clauses can be calculated multiplying the selectivity of single attributes. Multiplying this “combined selectivity” with the number of records in the anonymized index table results in the estimated number of records that are selected by a query.

The assumption of uniformly distributed attribute values does not hold for real use-cases. For non-uniformly distributed attribute values the model in Eq. 4 produces inaccurate predictions. To account for non-uniformly distributed attribute values, the frequency of attribute values can be maintained in statistics kept by the SP. In the following we will denote the occurrence probability of value \(v\) for attribute \(a\) as \(p_a(v)\).

Non-uniformly Multi Selection Model: Based on the occurrence probabilities, the expected number of matching records for a query \(q\) that is evaluated on an index table \(I\) can be calculated as follows:

$$\begin{aligned} {r(I,q)=\prod _{a\in A_I}\sum _{v \in q_a}p_a(v) \cdot N_I} \end{aligned}$$
(5)

The selectivity of a multi selection amounts to \(\sum _{v \in q_a}p_a(v)\) as it suffices for a record to contain any attribute value that is selected in the multi selection. This corresponds to the OR clauses that concatenate conditions on the same attribute.

4 Evaluation

The SO-model (Eq. 2) has been analytically derived and accurately calculates the number of records that are contained in an index table that was built from an anatomized view. We validate the QL-model in Sect. 4.1 and investigate the applicability of the MR-model for non-uniform and interdependent attribute value distributions in Sect. 4.2.

4.1 QL-model Validation

We measured the latencies of queries that were executed on a variety of index tables and checked whether the execution times are consistent with our models. The measurements were conducted in the same setup that we described in Sect. 3.1. As exploring the entire parameter space of possible index tables that might occur is not feasible, we generated scenarios according to the \(2^k\)-factorial design method [31], i.e., we fixed two choices for each parameter and investigated every possible parameter choice combination. We chose between 3 and 5 for the \(\ell \)-parameter and between one or two sensitive attributes that are also contained in the QID or no sensitive attributes at all. Regarding data structure, we chose between uniformly and irregularly distributed attribute values and fixed the size of synth1/synth2’s value domain to either 20 or 2000. For each configuration, we measured the latencies of 6000 queries that select either 1,4,7,10,13 or 16 values.

Validating Assumption I: The query latency linearly depends on the number of matching records.

The measured query latency subject to the number of records that matched the query of one configuration is shown in Fig. 2. Except for some outliers (\(\sim \)60 outliers vs. 6000 measurements), the query latency measurements linearly depend on the number of matching records for the investigated configuration and parameterizing the QL-model as described in Sect. 3.3 results in a good fit (cf. the dotted line in Fig. 2). Measurements of the other configurations confirmed this finding (see [30]).

Fig. 2.
figure 2

Linear dependency: Number of matching records vs. query latency

Validating Assumption II: The parameters \(c, f\) of the QL-model for a given index table can be reused for other index tables that contain other data, but the same number of records \(N_I\).

Based on our measurements, we parameterized the QL-model for each index table we created from our \(2^k\)-factorial design (see Table 2). The parameters indicate that the data’s structure, i.e., the number of attribute values and their distribution, is not a significant influencing factor for the parameters \(c\) and \(f\). Thus, the parameters can be determined once for an index table and can then be reused for other index tables that contain different data but have the same anonymity requirements and therefore contain the same number of records \(N_I\) without actually building them.

Table 2. Comparing the measured values for each given anonymity parameter choice underlines the independence of the \(c\), \(f\) parameters from the data structure for each investigated \(2^k\)-factorial design configuration.

To confirm the validity of the general QL-model in combination with the MR-model, we measured the deviation of the measured real latency from the latency that our parameterized models predicted. We performed these measurements for each configuration of the \(2^k\)-factorial design. The aggregated results are shown in Fig. 3. The x-axis represents the absolute deviation of the QL-model from the real latencies in milliseconds. On the y-axis, the number of queries with the according deviation from the models are shown. The measurements show that, the predictions of the QL-model very closely matches the actual query latency. Combined with our results that query latency linearly depends on the number of matching records this finding also strongly indicates that our matching records model is sound.

Fig. 3.
figure 3

Aggregated deviations: QL-model predictions vs. real measurements (histogram)

4.2 Applicability of the MR-model

We based the analytical models to predict the number of records matching a query in an index table on Assumption III of independent attribute value distributions. In this section we show in which cases Assumption III is really necessary to correctly apply the models to derive a query execution strategy and an index creation strategy.

When deriving a query execution strategy for a given query \(q\), the query \(q\) is exactly known and Eq. 5 can be applied. The validation of the QL-model presented in Sect. 4.1 included uniformly as well as irregularly distributed attribute values and showed that the QL-models can accurately predict the query latency in both cases. As query latency linearly depends on the number of matching records (cf. Sect. 4.1) this strongly indicates that our MR-model is sound for independently distributed attribute values. If the attribute value distributions depend on each other, using the proposed models leads to an overhead prediction error that depends on the strength of the attribute dependency. Even in the field of query optimization the assumption of independent attributes is often made and errors are accepted [25, 26]. Thus, this issue remains a challenging problem for future work for query optimization in general.

Table 3. Correctness of the models for index table creation vs. input information and data characteristics

When determining an index creation strategy, the models are applied on query workloads instead of single queries. In this setting, we can prove that our models are sound even if the attribute value distributions depend on each other under certain assumptions. Intuitively this can be explained by the fact that the prediction errors of our models even out. An overview of the assumptions that need to be made to apply the MR-models is shown in Table 3 depending on the attribute value distribution and the available information that can be provided.

If the attribute values are uniformly and independently distributed, Eq. 4 can be used to correctly determine the average number of matching records for a set of queries. In this case, is does not matter whether the user specified the expected queries in a precise or abstract way (cf. Sect. 3.3) or whether the SP adapts the strategy dynamically based on the past workload.

If the attributes are irregularly but independently distributed, Eq. 5 can be applied to correctly determine the average number of matching records if the user specified precise expected queries or the SP adapts the index strategy based on the past workload. If the user only specified abstract queries, Eq. 4 still can be correctly applied if the selected values by future queries are uniformly distributed (cf. Theorem 1). In case a user can only specify abstract queries, she is very likely not able to specify how the values selected by the future queries are distributed and possibly has to assume that they are uniformly distributed.

Theorem 1

Let’s assume that the assumption of uniformly distributed attribute values does not hold for a set of data. Equation 4 can be used to accurately calculate the expected number of matching records if the values selected by the queries are uniformly distributed.

Proof: See [30] for the proof.

Furthermore, if it can be assumed that value combinations selected by future queries are uniformly distributed, it can also be shown that Eq. 4 can be used to correctly calculate the expected number of matching records even if the attributes are not independently distributed (cf. Theorem 2). In this case Eq. 4 is applicable to minimize the performance overhead for index tables that contain interdependently distributed attribute values.

Theorem 2

Let’s assume that the assumption of uniformly and/or independently distributed attribute values does not hold for a set of data. Equation 4 can be used to accurately calculate the expected number of matching records if the value combinations selected by queries are uniformly distributed.

Proof: See [30] for the proof.

5 Discussion, Limitations, Future Work

Functionality of Dividat: We addressed selections of records based on equality and inequality conditions as query constructs in this paper. However, the proposed concepts and models can be applied to most other SQL conditions. For instance, LIKE operators constitute multi selection in our terminology. The parameters \(c\) and \(f\) of the QL-model (cf. Sect. 3.3) have to be determined for each type of multi selection as they might differ depending on indexing approaches that are utilized by the underlying database system used by the SP.

We considered query latency as one of the main optimization goals in this paper. While this is true for many use cases, applications exist that update the outsourced data at a high frequency. For these applications the performance of INSERT, UPDATE and DELETE operations should be included in the optimization process. Our QL-model can be applied to those operations analogously to SELECT operations. The optimization problem (cf. Eq. 1) to determine which index tables should be created for a query workload can be easily extended to also consider the overhead for these additional operations.

Overheads Induced by Dividat: Our models require statistical data that needs to be maintained and queried to optimize the execution strategy for each query. These processes induce a performance overhead on their own. The problem of processing statistical information in traditional query optimization closely resembles the problem at hand and can be applied in this context as well [2426].

The choice which index tables should be created has to be made a-priori based on the predicted future data structure and the query behavior. These predictions are not necessarily accurate. Dividat allows the SP to build index tables without help from the trusted client. Thus, once the predicted state differs too much from reality, a new prediction can be made and the indexing strategy can be adapted without inducing any overhead for the trusted client.

The Dividat approach aims to minimize the overhead induced at the SP to evaluate incoming queries based on existing index tables. It considers both latency overhead and storage demand of the index tables. In our future work we plan to extend the models to consider additional factors such as the network transmission overhead and the crypto overhead for the trusted client that is induced by transmitted false-positive records that do not match the query. This will enable the SP to develop indexing strategies to optimize the overall query execution performance from a user’s perspective.

6 Conclusion

In this paper we proposed Dividat, an approach to outsource databases while preserving the \(\ell \)-diversity-based anonymity of individuals. Compared to previous approaches, Dividat enhances query execution performance by optimizing the use of anonymized index tables. Measurements showed that the query latency of naively built anonymized index tables can differ up to a factor of 5 from the optimal reachable latency. We proposed optimization strategies for index creation and query execution based on models that a-priori estimate the efficiency of \(\ell \)-diversified index tables. The models were shown to accurately predict the efficiency of index tables. The proposed models can be used for index optimization but can also be seen as a starting point for an in-depth analysis to quantify the costs of database \(\ell \)-diversification. While our measurements already gave an impression of the performance cost induced by \(\ell \)-diversification of outsourced databases, a more generic analysis of the question “What is the cost that comes with database \(\ell \)-diversification?” is a matter of future work.