1 Introduction

Record linkage between databases containing information about individuals is popular in a large number of medical applications, for example the identification of patient deaths [1], the evaluation of disease treatment [2] and the linkage of cancer registries in follow-up studies in epidemiology [3]. In many applications data sets are merged using personal identifiers such as forenames, surnames, place and date of birth. Due to privacy concerns this has to be done via privacy-preserving record linkage (PPRL). However, since personal identifiers often contain typing or spelling errors, encrypting the identifier values and linking only those that match exactly does not provide satisfactory results. Therefore, to allow for errors in encrypted personal identifiers, in many European countries encrypted phonetic codes, such as Soundex codes, are commonly used, especially by cancer registries. As the performance of these codes is still non satisfactory, several novel privacy-preserving record linkage methods have been suggested during the last years. For example Schnell et al. [4] developed a method based on Bloom filters. Bloom-filter-based record linkage has already been used in medical applications in a number of different countries [58].

Another frequently applied privacy-preserving record linkage method uses anonymous linking codes [9]. The basic principle of an anonymous linking code is to standardize all particular identifiers of a record (removal of certain characters and diacritics, use of upper case letters), to concatenate them to a single string and finally to put this single string into a cryptographic hash function. By combining this principle with Bloom filters, Schnell et al. [10] first developed a novel error-tolerant anonymous linking code, called Cryptographic Longterm Key (CLK). Instead of encrypting every single identifier from a record of several identifiers through a Bloom filter, multiple identifiers are stored in one single Bloom filter, called CLK. Tests on several databases showed that CLKs yield good linkage properties, superior to well-known anonymous linking codes [10].

Only recently Randall et al. [7] presented a study on 26 million records of hospital admissions data and showed that privacy-preserving record linkage with Bloom filters built from multiple identifiers is applicable to large real-world databases without loss in linkage quality.

However, only little research on the security of Bloom filters built from more than one identifier has yet been published (see Subsect. 2.2). In several countries, this lack of research prevents the widespread use of Bloom filter encryptions for real-world medical databases (such as cancer registries) where the anonymity of the individuals has to be guaranteed. For example, in its Beyond 2011 Programme the British Office for National Statistics investigated several methods for linking sensitive data sets [11]. The investigators came to the conclusion that none of the ’(...) recent innovations, such as bloom filter encryption (...)’ can be recommended because they ’(...) have not been fully explored from an accreditation perspective’. Thus, research showing drawbacks of the recent Bloom filter techniques is important because it guides the direction for future research and might motivate further development of the recent procedures.

In this paper, we intend to investigate this issue in detail by giving the first convincing cryptanalysis of Bloom filter encryptions built from more than one identifier.

2 Background

In 1970, Bloom [12] introduced a novel approach that permits the efficient testing of set membership through a probabilistic space-efficient data structure. A Bloom filter is a bit array of length L, which is initialized with zeros only. Let \(S \subseteq \mathcal {U}\) be a subset of a universe \(\mathcal {U}\). Then S can be stored in a Bloom filter \(\mathcal {B} = \mathcal {B}(S) = (b_{0} , \dots , b_{L-1})\) in the following way: Each element \(s \in S\) is mapped via k different hash functions \(h_{0}, \dots , h_{k-1}: S \longrightarrow \{ 0, \dots , L-1 \}\) and all the corresponding bit positions \(b_{h_{0}(s)}, \dots , b_{h_{k-1}(s)}\) are set to one. Once a bit position is set to one by an element \(s \in S\), this value no longer changes.

In order to test whether an item \(u \in \mathcal {U}\) from the universe is contained in S, u is hashed through the k hash functions \(h_{0}, \dots , h_{k-1}\) as well. Consequently, if all bit positions \(b_{h_{0}(u)}, \dots , b_{h_{k-1}(u)}\) in the Bloom filter are set to one, then \(u \in S\) holds with high probability. However, in this case false positive classifications can occur when all the ones on the positions \(h_{0}(u), \dots , h_{k-1}(u)\) are caused by elements from S distinct from u. Then the test indicates \(u \in S\) although this is not the case. Otherwise, if at least one bit position in the two Bloom filters varies, u clearly is no member of S. This latter case is illustrated in Fig. 1.

Fig. 1.
figure 1

A Bloom filter storing the elements A and B from a set S. Two hash functions (illustrated by solid and dashed respectively) are used. Set-membership of another element C can be checked by hashing C through the same two hash functions. In this example, it is guaranteed that C is not a member of S since one of the positions in the array to which C is hashed equals zero.

2.1 PPRL with Bloom Filters Built from Multiple Identifiers

In [4] Bloom filters were used in privacy-preserving record linkage for the first time. This approach was expanded to Cryptographic Longterm Keys in [10].

In common PPRL protocols two data owners A and B agree on a set of identifiers that occur in both of their databases. Next, these identifiers are standardized, then padded with blanks at the beginning and the end, and finally split into substrings of two characters. Each substring of the first identifier corresponding to a record is mapped to the first Bloom filter via several hash functions. Afterwards, each substring of the second identifier, corresponding to the same record, is mapped through another set of hash functions to the first Bloom filter as well. This procedure is repeated until all identifiers of the first record are stored in the first Bloom filter. Next, all identifiers corresponding to the second record of the database are mapped through the utilized hash functions to a second Bloom filter and so on. Performing this procedure for all entries of the database results in a set of Bloom filters where each Bloom filter is built from multiple identifiers. The similarity (e.g., Jaccard similarity) of the resulting Bloom filters is a very good approximation of the (Jaccard) similarity of the unencoded bigram sets and thus the unencoded identifier values. This latter fact is illustrated through Fig. 2 in the case of one identifier.

Fig. 2.
figure 2

The similarity of Bloom filters yields a very good approximation to the similarity of unencoded bigram sets. The bigram sets \(A=\){␣M, ME, EI, IE, ER, R␣} and \(B=\){␣M, ME, EY, YE, ER, R␣} of the strings ␣MEIER␣ and ␣MEYER␣ have a Jaccard similarity of \(\frac{\vert A \cap B \vert }{\vert A \cup B \vert }=\frac{4}{8}=0.5\). The Jaccard similarity of binary vectors is defined as \(\frac{n_{11}}{n_{01}+n_{10}+n_{11}}\) where \(n_{ij}\) is the number of bit positions that are equal to i for the first and equal to j for the second vector. In our example, we get a value of \(\frac{8}{15}\approx 0.53\).

Because of the specific structure of Bloom filters, record linkage based on Bloom filters built from multiple identifiers allows for errors in the encrypted data. Therefore, they can be applied to linking large data sets such as national medical databases [7].

2.2 Extant Research: Attacks on Bloom Filters of One or More Identifiers

To the best of our knowledge, only two ways of attacking Bloom filters of one identifier and one way of attacking Bloom filters of multiple identifiers are known so far.

The first cryptanalysis of Bloom filters was published in 2011. Kuzu et al. [13] sampled 20,000 records from a voter registration list and encrypted the substrings of two characters from the forenames through 15 hash functions and Bloom filters of length 500 bits. Their attack consisted in solving a constraint satisfaction problem (CSP). Through a frequency analysis of the fornames and the Bloom filters and by applying their CSP solver to the problem, Kuzu et al. were able to decipher approximately 11 % of the data.

In contrast, Niedermeyer et al. [14] proposed an attack on 10,000 Bloom filters built from encrypted German surnames that were considered to be a random sample of a known population. For the generation of the Bloom filters 15 hash functions and Bloom filter length 1,000 were used. Then they conducted a manual attack based on the frequencies of the substrings of length two, which they derived from the German population. Thus, Niedermeyer et al. deciphered the 934 most frequent surnames of 7,580 different ones, which corresponds to approximately 12 % of the data set. However, their attack is not limited to the most frequent names and could be extended to the decipherment of nearly all names.

In 2012 Kuzu et al. [15] proposed an attack on Bloom filters built from multiple identifiers. They applied their constraint solver to forename and surname, as well as forename, surname, city and ZIP code, of 50,000 randomly selected records from the North Carolina voter registration list. However, they were not able to mount a successful attack. Thus, Kuzu et al. supposed that combining multiple personal identifiers into a single Bloom filter would offer a protection mechanism against frequency attacks. Although they suspected that their attack did not uncover all vulnerabilities of the Bloom filter encodings, they showed that the CSP for multiple identifiers is intractable to solve by their constraint solver.

2.3 Our Contribution

In this paper we present a fully automated attack on a database containing forenames, surnames and the relevant place of birth as well. All records are considered to be a random sample of a known population. We suppose that the attacker only knows some publicly available lists of the most common forenames, surnames and locations. The attack is based on analyzing the frequencies and the combined occurence of substrings of length two from the identifiers of these lists. Furthermore, we are interested in re-covering as many identifiers as possible. Our cryptanalysis was implemented using the programming languages Python and C++.

3 Encryption

In this section some basic notation is introduced and the encrypting procedure is described.

In record linkage scenarios, strings are usually standardized through transformations such as capitalization of characters or removal of diacritics [16].

After this preprocessing step all strings contain only tokens from some predefined alphabet \(\varSigma \). Throughout this article, we use the canonical alphabet \(\varSigma := \{ \texttt {A}, \texttt {B}, \ldots , \texttt {Z}, \texttt {\textvisiblespace } \}\), where \(\texttt {\textvisiblespace }\) denotes the padding blank. Thus, for example the popular German surname Müller is transformed to ␣MUELLER␣ in the preprocessing step. As usual, we denote substrings of two characters with bigrams and the set containing all the bigrams with \(\varSigma ^2\), i.e.

$$\begin{aligned} \varSigma ^2=\{\texttt {\textvisiblespace \textvisiblespace }, \texttt {\textvisiblespace A}, \dots , \texttt {\textvisiblespace Z}, \texttt {A\textvisiblespace }, \dots , \texttt {Z\textvisiblespace }, \texttt {AA}, \dots , \texttt {ZZ}\}. \end{aligned}$$

The Bloom filter encryption of a record from a database is created by storing the bigram set associated with this record into a Bloom filter. The bigram set associated with a record is defined as the set containing the bigrams from all the identifiers. Here, a distinction between the bigrams occuring in different identifiers has to be made. Thus, if the set of identifiers is denoted with \(\mathcal {I}\), the bigram set of a record is a subset of \(\mathcal {I} \times \varSigma ^2\).

For example, if we have \(\mathcal {I}=\{\texttt {surname}, \texttt {forename}\}\) and the database contains a record, Peter Müller, the bigram set associated with this record would contain the bigrams \(\texttt {\textvisiblespace P}_f\), \(\texttt {PE}_{f}\), \(\texttt {ET}_f\), \(\texttt {TE}_f\), \(\texttt {ER}_f\), \(\texttt {R\textvisiblespace }_f\), \(\texttt {\textvisiblespace M}_s\), \(\texttt {MU}_s\), \(\texttt {UE}_s\), \(\texttt {EL}_s\), \(\texttt {LL}_s\), \(\texttt {LE}_s\), \(\texttt {ER}_s\) and \(\texttt {R\textvisiblespace }_s\) (the subscript f indicates the bigrams occuring in the forename identifier, the subscript s the ones occuring in the surname identifier).

Next, this bigram set is stored into a Bloom filter \((b_0,\ldots ,b_{L-1})\) of length L by means of k independent hash functions

$$\begin{aligned} h_i: \mathcal {I} \times \varSigma ^2 \rightarrow \{0,\ldots ,L-1\} \end{aligned}$$

for \(i=0,\ldots ,k-1\). In practice, one could alternatively use different hash functions \(h_i: \varSigma ^2 \rightarrow \{0,\ldots ,L-1\}\) for the distinct identifiers in order to guarantee that the hash values for distinct identifiers are not the same.

Further, as in [14] we introduce the term atom for the specific Bloom filters which occur as the fundamental building blocks of the encryption method.

Definition 1

(Atom). Let \(L, k \in \mathbb N\) and some hash functions \(h_{0},\ldots ,h_{k-1}\) be defined as above. Then, a Bloom filter

$$\begin{aligned} \mathcal {B} := (b_{0}, \dots , b_{L-1}) \in \{0,1\}^{L} \end{aligned}$$

is termed an atom if there exists a bigram \(\beta \in \mathcal {I} \times \varSigma ^2\) such that \(b_j=1 \Leftrightarrow h_i(\beta )=j\) for some \(i=0,\ldots ,k-1\). Such a Bloom filter is called the atom realized by the bigram \(\beta \) and denoted with \(\mathcal {B}(\beta )\).

Thus, atoms are special Bloom filters. Since each bigram is hashed via each \(h_i\) for \(i=0,\ldots ,k-1\), at most k positions in an atom can be set to one.

By combining the atoms of the underlying bigram set of a record with the bitwise OR operation, the Bloom filter of a record is composed as

$$\begin{aligned} \mathcal {B}(\texttt {record}) = \bigvee _{\beta \in \mathcal {S}_{\text {record}}} \mathcal {B}(\beta ), \end{aligned}$$

where \(\bigvee \) denotes the bitwise OR operator.

Note that the same bigram from \(\varSigma ^2\) is hashed differently if it occurs in distinct identifiers. This is illustrated in Fig. 3 for the example of the bigram \(\texttt {ER}\) which occurs in the record Peter Müller both in the surname and the forename identifier.

Fig. 3.
figure 3

Two different atoms of the bigram ER. These atoms are realized when instances of ER occur in distinct identifiers (the forename and surname identifier in this example).

Mapping each bigram of the forename Peter with k hash functions results in six atoms; for the surname Müller, we get eight atoms. Thus, the separate Bloom filters for these identifiers might be composed as illustrated in Fig. 4.

Fig. 4.
figure 4

Bloom filters of the forename Peter and the surname Müller, composed of the atoms belonging to the underlying bigrams.

Fig. 5.
figure 5

The Bloom filter of the record Peter Müller is obtained by applying the bitwise OR operation to the Bloom filter encryptions of the separate identifiers.

The final Bloom filter for the record Peter Müller is composed by appling the bitwise OR operation to the separate Bloom filter encryptions of the distinct identifiers. This is demonstrated in Fig. 5.

In practice, the Bloom filter encryption of a record might contain a mixture of string valued identifiers (such as forename, surname or place of birth) and also numerical identifiers, such as date of birth. However, in this paper we restrict ourselves to the case of string valued attributes only, albeit our cryptanalysis proposed below is not limited to such attributes.

Assumptions

In many record linkage scenarios, it is supposed that a semi-trusted third party conducts the record linkage between two encrypted databases. In this paper we assume a data set containing Bloom filters built from multiple identifiers that is sent to such a semi-trusted third party. This third party acts as the adversary and tries to infer as much information as possible from the record encryptions. We further suppose that the attacker has knowledge of the encryption process.

For our experiment we generated 100,000 Bloom filters built from standardized German forenames, surnames and cities according to the distribution in the population. The identifiers were truncated after the tenth letter, padded with blanks, respectively, and were broken into bigrams. Then the bigrams were hashed through \(k=20\) hash functions into Bloom filters of length \(L=1,000\). As proposed in [4, 10], we used the so-called double hashing scheme for the generation of k hash functions from two hash functions f and g. This double hashing scheme is defined via the equation

$$\begin{aligned} h_i = (f + i \cdot g) \mod L \quad \text { for } i=0,\ldots ,k-1 \end{aligned}$$
(1)

and was originally proposed in [17] as a simple hashing method for Bloom filters yielding satisfactory performance results.

In our cryptanalysis we assume that the adversary knows that the hash values are generated in accordance with Eq. (1). It is self-evident that she must not have direct access to the hash functions f and g since this would permit the adversary to check whether a specific bigram is contained in a given Bloom filter.

Note that the double hashing scheme has also been used for the generation of Bloom filters by Kuzu et al. [15]. However, in that paper the knowledge of the double hashing scheme was not exploited in their cryptanalysis.

4 Cryptanalysis

This section provides a detailed description of the deciphering process. At first we try to detect the atoms that are contained in the given Bloom filters. Then, we assign bigrams to these atoms by means of an optimization algorithm. Finally, the original attributes are reconstructed from the atoms.

Our approach for the development of a fully automated attack is based on previous results on the automated cryptanalysis of simple substitution ciphers presented by Jakobsen [18]. We give a short account of Jakobsen’s results in order to motivate our procedure.

4.1 Automated Cryptanalysis of Simple Substitution Ciphers

The encryption of a plaintext message through a simple substitution cipher is defined by a permutation of the underlying alphabet \(\varSigma \). For instance, the message HELLOLISBON with tokens from the alphabet \(\varSigma = \{ \texttt {\textvisiblespace }, \texttt {A}, \texttt {B}, \dots , \texttt {Z}\}\) could be encrypted as RVUUYJUOWAYL. It is well known that this kind of encryption can be broken easily by means of a frequency analysis. However, just replacing the i-th frequent character in the ciphertext with the i-th frequent character in the underlying language will usually not lead to the correct decipherment (even for longer messages). This fact is commonly compensated for by taking bigram frequencies into consideration as well.

The expected bigram frequencies can be obtained from a training data set composed of the underlying language and stored in a quadratic matrix E (in the above example a \(27 \times 27\) matrix), where the entry \(e_{ij}\) is equal to the relative proportion of the bigram \(c_{i}c_{j}\) in the training text corpus and \(c_{i}\) denotes the i-th character of the alphabet. Analogously, the bigram frequencies of the ciphertext can be stored in a matrix D.

The algorithm proposed by Jakobsen [18] was intended to find a permutation \(\sigma _{\text {opt}}\) of the alphabet such that the objective function f defined via

$$\begin{aligned} f(\sigma ) := \sum _{i,j} |d_{\sigma (i)\sigma (j)} - e_{ij} |\end{aligned}$$
(2)

is minimized. The algorithm starts with the initial permutation that reflects the best assignment between single characters in the plaintext and the ciphertext with respect to their relative frequency. In each step of the algorithm two elements of the currently best permutation \(\sigma _{\text {opt}}\) are swapped, leading to a new candidate permutation \(\sigma \). If \(f(\sigma ) < f(\sigma _{\text {opt}})\) holds, the current permutation is updated to \(\sigma \), otherwise \(\sigma \) is discarded and a new candidate \(\sigma \) is generated by swapping two other elements of \(\sigma _{\text {opt}}\). This is repeated until no swap leads to a further improvement of the objective function f. Throughout this paper we use the same strategy as Jakobsen in [18] in order to determine the elements of the current permutation to be swapped. For a more detailed description of Jakobsen’s method in the case of simple substitution ciphers we refer the reader to his original paper [18]. Figure 2 in [18] shows that a ciphertext of length 600 built by a simple substitution cipher can be entirely broken by this method.

It is clear that some modification of Jakobsen’s original algorithm is necessary in order to make it applicable in our setting as well. In particular, the definitions of the matrices D and E must be changed. Their adopted definitions are introduced in Subsect. 4.3.

4.2 Atom Detection

As in [14], the basic principle of our approach consists in the detection of atoms, which represent the encryption of one single bigram only. Since the Bloom filter of a string is created by the superposition of at least a few atoms, the reconstruction of the atoms given only a set of Bloom filters turns out to be difficult. Note that this task cannot be solved in a satisfactory manner if Bloom filters are considered isolatedly or in small groups because in this case too many binary vectors will be wrongly classified as atoms.

Let us give a short motivation for our novel method aiming at atom detection. If the bitwise AND operation is applied to a set of Bloom filters that have one bigram \(\beta \) in common, at least all positions set to one by \(\beta \) are equal to one in the result. However, for prevalent bigrams it should be expected that all the other positions are set to zero if a sufficient number of Bloom filters are considered, i.e., the result would be exactly the atom induced by the bigram \(\beta \).

Of course, if an adversary has access to a set of Bloom filters, she does not a priori know which Bloom filters have a bigram in common. This obstacle can be avoided as follows: Under the assumption that the double hashing scheme is being used, the adversary is able to determine for each combination of bit positions from Eq. (1) the set of Bloom filters for which all these positions are set to 1. Then, the bitwise AND operation is applied to the set of these Bloom filters. If the result coincides with the atom, it is considered to be the realization of a bigram by the adversary.

The resulting set of atoms was further reduced by discarding atoms of Hamming weight \(\sum _{i=0}^{999} b_i\) equal to 1, 2, 4 or 5 and keeping only atoms of Hamming weight equal to 8, 10 or 20. Otherwise, too many binary vectors would have been classified incorrectly as atoms. The probability that an atom has Hamming weight less than 8 in our setting is equal to 0.008. This value can be derived in analogy to Lemma A.1 and the subsequent example in [14].

We denote the number of atoms found by n. For our specific data set we got \(n=\) 1,776. This result seems reasonable, because the total number of possible atoms is bounded from above by 2,187 and obviously not all of these atoms, in particular atoms realized by rare bigrams, occur in our simulated data. As we checked later on, 1,337 of the 1,776 extracted conjectured atoms were indeed true atoms, that is to say atoms generated by one of the 2,187 bigrams. The subsequent analysis demonstrates that this percentage of correct atom detection is sufficient for a successful cryptanalysis. For each atom \(\alpha \) we determined the set of Bloom filters containing this atom, i.e. Bloom filters for which all bit positions of the atom are set to 1. We denote the atoms with \(\alpha _{1}, \dots , \alpha _{1776}\) according to decreasing frequency. In order to give an illustrative example, we assert that in the Bloom filter No. 850 (which looks like \(1011011111\ldots 1110110010\)) the atoms \(\alpha _{5}\), \(\alpha _{8}\), \(\alpha _{14}\), \(\alpha _{15}\), \(\alpha _{29}\), \(\alpha _{33}\), \(\alpha _{36}\), \(\alpha _{46}\), \(\alpha _{55}\), \(\alpha _{106}\), \(\alpha _{110}\), \(\alpha _{123}\), \(\alpha _{138}\), \(\alpha _{169}\), \(\alpha _{194}\), \(\alpha _{197}\), \(\alpha _{218}\), \(\alpha _{254}\), \(\alpha _{309}\), \(\alpha _{313}\), \(\alpha _{317}\), \(\alpha _{334}\), \(\alpha _{335}\), \(\alpha _{396}\), \(\alpha _{398}\), \(\alpha _{453}\), \(\alpha _{607}\), \(\alpha _{668}\), \(\alpha _{705}\), \(\alpha _{782}\), \(\alpha _{821}\), \(\alpha _{960}\) and \(\alpha _{1131}\) were detected.

In the subsequent section we explain how correlations between the occurences of atoms in the Bloom filters and bigrams in a training data set can be used to give adequate definitions of the matrices D and E that serve as the input of Jakobsen’s algorithm.

4.3 Correlation of Atoms and Bigrams

A naive assignment of bigrams to atoms is possible only for few frequent bigrams at most. For example, if German surnames, given names and birth locations are considered together, usually the most frequent bigram is \(\texttt {A\textvisiblespace }_f\) (the bigram A␣ in the forename identifier) such that the most frequent atom is likely to be the encryption of this bigram. The absolute frequencies of the 10 most frequent bigrams in the considered training data are illustrated in Fig. 6.

Fig. 6.
figure 6

Absolute frequencies of the 10 most frequent bigrams in our training data set.

Except for the first few bigrams, the bigram frequencies are too close together such that naive matching is not promising for automatic decipherment.

In the example of Bloom filter No. 850 already introduced above, this naive assignment would lead to the conjecture that the corresponding record contains the following bigrams: \(\texttt {N\textvisiblespace }_l\), \(\texttt {R\textvisiblespace }_s\), \(\texttt {CH}_s\), \(\texttt {N\textvisiblespace }_f\), \(\texttt {HE}_l\), \(\texttt {\textvisiblespace \textvisiblespace }_l\), \(\texttt {SC}_s\), \(\texttt {S\textvisiblespace }_f\), \(\texttt {E\textvisiblespace }_l\), \(\texttt {\textvisiblespace L}_f\), \(\texttt {BE}_s\), \(\texttt {NI}_f\), \(\texttt {AR}_s\), \(\texttt {\textvisiblespace W}_f\), \(\texttt {\textvisiblespace P}_f\), \(\texttt {NG}_s\), \(\texttt {IR}_f\), \(\texttt {ET}_s\), \(\texttt {MI}_s\), \(\texttt {NI}_s\), \(\texttt {VE}_l\), \(\texttt {OS}_l\), \(\texttt {NS}_s\), \(\texttt {UN}_s\), \(\texttt {AT}_s\), \(\texttt {\textvisiblespace V}_s\), \(\texttt {LH}_l\), \(\texttt {OW}_l\), \(\texttt {AA}_s\), \(\texttt {ZB}_l\), \(\texttt {RR}_l\), \(\texttt {DY}_f\) and \(\texttt {MR}_s\). However, from this list of bigrams it is obviously impossible to reconstruct any meaningful information.

For this reason, we also took correlations between bigrams into account. For example, for records sampled from the population of Germany the appearance of the bigram \(\texttt {CH}_s\) in a record makes the appearance of the bigram \(\texttt {SC}_s\) in the same record more likely because the trigram SCH frequently appears in German surnames.

We model this kind of information on the correlation of atoms and bigrams by means of two matrices D and E. Assume that the attribution values of the records built from tokens of the alphabet \(\varSigma = \{\texttt {\textvisiblespace }, \texttt {A}, \texttt {B}, \dots , \texttt {Z}\}\) are to be encrypted. Thus, for each (string valued) identifier we have 729 possible bigrams. Since the same bigram is encrypted differently for each identifier we have to distinguish between different instances of the same bigram. In our setting we denote the bigram \(\beta \) for the surname, forename and location identifier with \(\beta _s\), \(\beta _f\) and \(\beta _l\), respectively. Altogether, the set \(\varSigma ^{2}\) containing all possible bigrams consists of \(3 \cdot 729=\) 2,187 elements.

Let us now introduce the matrix E containing information about the expected bigram correlations obtained from the training data set. Note that the training data should be as similar to the encrypted data as possible, e.g. a random sample from the same underlying population as the encrypted data. If the prevailing Bloom filters are known to contain encryptions of records from the German population, an attacker would try to get access to a comparable database containing the same identifiers. Thus, the choice of the reference dataset should depend on the context. For example, we would expect our cryptanalysis to be less successful when the Bloom filters mainly encrypt German names whereas the training data consists of a random sample from the French population. The attribute values of this training data set are preprocessed analogously to the preprocessing routine before the encryption process. Then, the bigram sets for all the attribute values are created. We denote the bigrams with \(\beta _{1}, \dots , \beta _{2187}\) according to decreasing frequency. Let T be the total number of records in the training data set and \(t_{ij}\) the number of records that contain both bigram \(\beta _{i}\) and bigram \(\beta _{j}\). Then the matrix \(E = (e_{ij})_{i,j = 1,\dots , 2187}\) is defined via

$$\begin{aligned} e_{ij} = {\left\{ \begin{array}{ll} t_{ij}/T &{}\text{ if } i\ne j, \\ 0 &{} \text{ if } i=j. \end{array}\right. } \end{aligned}$$

The matrix D is formed in a similar way on the basis of joint appearances of atoms in the Bloom filters. Let N be the number of Bloom filters for which atoms have been extracted. We denote the number of Bloom filters that contain both atom \(\alpha _{i}\) and atom \(\alpha _{j}\) by \(b_{ij}\). The matrix \(D = (d_{ij})_{i,j = 1,\dots , 2187}\) is defined through

$$\begin{aligned} d_{ij} = {\left\{ \begin{array}{ll} b_{ij}/N &{}\text{ if } i\ne j \text { and } i,j \le 1776, \\ 0 &{} \text{ if } i=j \text { or } \max (i,j)>1776. \end{array}\right. } \end{aligned}$$

The procedure suggested by Jakobsen which was described above can now directly be applied to the matrices D and E. The pseudocode for this can be found in Algorithm 1.

The progress of the optimization algorithm is illustrated in Fig. 7.

Fig. 7.
figure 7

Progress of the optimization algorithm for our data set. The initial value of the objective function is 370.99 and 2,812 updating steps were performed. The final value of the objective function \(f(\sigma _{\text {opt}})\) was equal to 168.5.

The result of the algorithm will be the final assignment between atoms and bigrams defined by a permutation \(\sigma _{\text {opt}} \in S_{2187}\) and the assignment rule \(\alpha _{\sigma _{\text {opt}}(i)} \rightarrow \beta _i\). This assignment is used to reconstruct the original bigram sets encrypted in the Bloom filters.

figure a

For example, the bigrams \(\texttt {ER}_s\), \(\texttt {R\textvisiblespace }_s\), \(\texttt {CH}_s\), \(\texttt {N\textvisiblespace }_f\), \(\texttt {HE}_l\), \(\texttt {\textvisiblespace \textvisiblespace }_l\), \(\texttt {SC}_s\), \(\texttt {\textvisiblespace S}_f\), \(\texttt {E\textvisiblespace }_l\), \(\texttt {HE}_s\), \(\texttt {\textvisiblespace K}_l\), \(\texttt {RL}_l\), \(\texttt {AR}_l\), \(\texttt {Z\textvisiblespace }_s\), \(\texttt {ON}_f\), \(\texttt {SI}_f\), \(\texttt {\textvisiblespace F}_s\), \(\texttt {IS}_s\), \(\texttt {LS}_l\), \(\texttt {HW}_l\), \(\texttt {SO}_f\), \(\texttt {RU}_l\), \(\texttt {UR}_s\), \(\texttt {IM}_f\), \(\texttt {KA}_l\), \(\texttt {MO}_f\), \(\texttt {AV}_f\), \(\texttt {FI}_s\), \(\texttt {UH}_l\), \(\texttt {HH}_l\), \(\texttt {SR}_l\), \(\texttt {UZ}_l\) and \(\texttt {MR}_s\) were assigned to the Bloom filter No. 850.

In the following section we describe how attribute values were reassembled from the reconstructed bigram sets.

4.4 Reconstruction of Attribute Values

In order to reconstruct the original attribute values of the records, we separated the bigrams belonging to different identifiers for each Bloom filter.

In the example of Bloom filter No. 850, we obtained the bigrams \(\texttt {N\textvisiblespace }_f\), \(\texttt {\textvisiblespace S}_f\), \(\texttt {ON}_f\), \(\texttt {SI}_f\),\(\texttt {SO}_f\), \(\texttt {IM}_f\), \(\texttt {MO}_f\), \(\texttt {AV}_f\) for the forename identifier, the bigrams \(\texttt {ER}_s\), \(\texttt {R\textvisiblespace }_s\), \(\texttt {CH}_s\), \(\texttt {SC}_s\), \(\texttt {HE}_s\), \(\texttt {Z\textvisiblespace }_s\), \(\texttt {\textvisiblespace F}_s\), \(\texttt {IS}_s\), \(\texttt {UR}_s\), \(\texttt {FI}_s\), \(\texttt {MR}_s\) for the surname identifier and finally the bigrams \(\texttt {HE}_l\), \(\texttt {\textvisiblespace \textvisiblespace }_l\), \(\texttt {E\textvisiblespace }_l\), \(\texttt {\textvisiblespace K}_l\), \(\texttt {RL}_l\), \(\texttt {AR}_l\), \(\texttt {LS}_l\), \(\texttt {HW}_l\), \(\texttt {RU}_l\), \(\texttt {KA}_l\), \(\texttt {UH}_l\), \(\texttt {HH}_l\), \(\texttt {SR}_l\), \(\texttt {UZ}_l\) for the location identifier. From this list it is already possible to guess the original identifier values at first glance.

Our fully automated approach to reconstructing the original identifier values was to compare the obtained bigram sets with a list of bigram sets generated from reference lists of surnames, names and locations. For Bloom filter No. 850, for example, an adversary would correctly obtain that this Bloom filter encrypts a record belonging to the person Simon Fischer from the German city Karlsruhe.

4.5 Results

By using the approach described above, we were able to reconstruct 59.6 % of the forenames, 73.9 % of the surnames and 99.7 % of the locations correctly. For 44 % of the 100, 000 records all the identifier values were recuperated successfully.

5 Conclusion

In this paper we demonstrated a successful fully automated attack on Bloom filters built from multiple identifiers. We were able to recover approximately 77.7 % of the original identifier values. In contrast to the assumptions in [14, 15], that storing all identifiers in a single Bloom filter makes it more difficult to attack, we needed only moderate computational effort and publicly available lists of forenames, surnames, and locations to reconstruct the identifiers. Note that there is no huge impact of the size of the database containing the Bloom filters. For our cryptanalysis it is sufficient to perform the attack on a subset of the given Bloom filters (100,000 as in our example should be adequate in most cases). Then for the remaining Bloom filters it would be sufficient to check for the atoms contained in those and to reconstruct the attribute values, since most assignments of atoms to bigrams are already known. Thus, the time needed for cryptanalysis is linear in the number of input Bloom filters. The time needed for the detection of atoms is \(O(L^2)\) since there are L possible values for the hash functions f and g in Eq. (1). Furthermore, the detection of atoms could easily be parallelized to make the computation faster. In addition, values of L significantly larger than \(L=1,000\) as considered in this paper would have negative effects on the time needed for performing the linkage between two databases (note that in the large scale study reported in [7] a Bloom filter length of only 100 was considered). Thus, the most time consuming step in our cryptanalysis should be the optimization algorithm presented in Subsect. 4.3. Indeed, in the chosen parameter setup this procedure took about 402 min on a notebook with 2.80 GHz Intel\(^{\textregistered }\) Core running Ubuntu 14.04 LTS.

To sum up, we do not recommend the usage of Bloom filters built from one or more identifiers, generated with the double hashing scheme, in applications where high security standards are required. However, we applied our attack in a very special scenario, because the generated databases were encrypted using the double hashing scheme. In case of arbitrary hash functions, i.e. without the restriction on their generation from two hash functions in accordance with Eq. (1), the detection of atoms becomes much harder since an iterative approach is no longer feasible in this case. To be more precise, the number of atoms increases significantly from less than \(L^2=10^6\) to \(\sum _{j=1}^{20} \left( {\begin{array}{c}1000\\ j\end{array}}\right) \ge 3\cdot 10^{41}\). However, we think that using independent hash functions alone will not be sufficient to ensure security, since in this case other approaches (maybe related to or at least inspired through work from the area of Frequent Itemset Mining [19]) are promising to detect at least the most frequent atoms automatically. The development of such a more general method for atom detection will be part of future work.

Niedermeyer et al. [14] proposed several methods such as fake injections, salting or randomly selected hash values to harden the Bloom filters. Hence, we are confident that methods like those proposed by Niedermeyer et al. show promise in the prevention of attacks like the one presented in this paper and might be suitable for PPRL of sensitive personal data. Further investigations in this direction will also be part of future work.