Elsevier

Data & Knowledge Engineering

Volume 72, February 2012, Pages 148-171
Data & Knowledge Engineering

Knowledge hiding from tree and graph databases

https://doi.org/10.1016/j.datak.2011.10.002Get rights and content

Abstract

Sensitive knowledge hiding is the problem of removing sensitive knowledge from databases before publishing. The problem is extensively studied in the context of relational databases to hide frequent itemsets and association rules. Recently, sequential pattern hiding from sequential (both sequence and spatio-temporal) databases has been investigated [1]. With the ever increasing versatile application demands, new forms of knowledge and databases should be addressed as well. In this work, we address the knowledge hiding problem in the context of tree and graph databases. For these databases efficient frequent pattern mining algorithms have already been developed in the literature. Since, some of the discovered patterns may be attributed as sensitive, we develop appropriate sanitization techniques to protect the privacy of the sensitive patterns.

Introduction

Privacy preserving data mining has been an active research area after the work of O'Leary [2]. In that work, it is shown that data mining is indeed a threat to database security. This is due to the fact that recent advances in data mining have resulted in highly sophisticated tools capable of mining any kind of knowledge including private and sensitive ones. Hence, the database publishing must be handled carefully, i.e. to prevent unintended disclosure a sanitization procedure should be applied before doing so. In other words, privacy of sensitive patterns should be protected. As a first step to the protection, the sensitive knowledge must be identified and provided as an input to the sanitization procedure. This is a subjective part which depends on the data publisher's privacy policy. Since the database to be published and the sensitive patterns can be diverse and may vary across institutions and applications, we need accompanying sanitization algorithms to facilitate the database publishing. Usually the sanitization algorithms are classified based on the form of database and sensitive knowledge to which they can be applied. For instance, when the database is relational and the sensitive knowledge format is of the form frequent itemsets, then the respective knowledge hiding algorithms are classified as frequent itemset hiding [3]. Similarly, we talk about association rule hiding algorithms when the database is relational and the sensitive knowledge format is association rules [4]. Recently, another class of sanitization algorithms has been developed to hide sequential (including spatio-temporal) patterns from sequence (including spatio-temporal trajectory) databases [5], [6], [1]. These problems are well-established and the literature currently focuses on developing more efficient and effective algorithms.

In this work, we study knowledge hiding problem in the context of tree and graph structured databases. Our motivation emerges from the fact that new applications demand for more structured databases and new forms of knowledge. For example, many applications today use XML (tree structured) databases and share data in XML too. Another example is recently growing social network applications (e.g. Facebook, DBpedia, LinkedIn) which utilize graph structured data. Moreover, data mining algorithms are already available for both of tree and graph structured databases.

Consider the scenario that a Web retail company organizes its products into a hierarchy and collects user accesses in an XML database. Also suppose that the company decides to publish it for do-it-yourself kind of analysis. The company is already aware that the receivers (including its competitors) can run frequent tree mining tools to extract frequent tree patterns, some of which may be sensitive due to potential commercial value. Knowing this, the company would prefer to publish a distorted version of the database so that those sensitive patterns but the others are hidden. Similarly, consider another company running a social network application which collects user friendship data and movie rating/tagging data together in a graph database. Clearly, this database may have sensitive frequent graph patterns that the company never would like to share with others. Note that a pattern may be called sensitive due to accompanying privacy, secrecy or commercial constraints. Similar to tree hiding case, we need appropriate graph hiding tools to facilitate graph database publishing while preventing the disclosure of sensitive patterns.

In this paper, we introduce the knowledge hiding problem for tree and graph database publishing. We present respective knowledge representations, problem formulations, sanitization techniques and experimental evaluations. Our contributions include the following.

  • Introduction of knowledge hiding problem for tree and graph databases. To the best of our knowledge, this is the first work addressing the issue.

  • Identification of relevant tree and graph patterns to be hidden. To this end, we consider four subtree containment classes, namely induced-ordered, induced-unordered, embedded-ordered, and embedded-unordered, and two subgraph containment classes, namely induced subgraphs and embedded subgraphs. All of the containment classes have the respective data mining algorithms, hence our hiding algorithms are all relevant in this sense.

  • Definition of the knowledge hiding problem for each of the containment classes. We also show the theoretical hardness of the problems.

  • Development of efficient and effective sanitization techniques. We develop heuristics to do so.

  • An extensive experimentation. We experiment with three tree and three graph databases to assess the utility of our proposals.

The paper is structured as follows. We first collocate our work within knowledge hiding and tree/graph mining/matching literature in the rest of this section. In Section 2, we briefly review frequent itemset hiding and sequential pattern hiding problems for completeness. The tree and graph databases and patterns are formalized in Section 3. We introduce tree (resp., graph) hiding problems in Section 4 (resp., 5). These sections also present our sanitization techniques and heuristics. Section 6 presents an experimental evaluation and Section 7 concludes.

Knowledge hiding is a special case for statistical disclosure control in databases [7], [8] and a subfield of privacy-preserving data publishing [9], [10]. The work by Atallah et al. [3] was the first studying the knowledge hiding from relational databases. They posed the problem of limiting disclosure of sensitive rules (frequent itemsets or association rules) by reducing their support and confidence. While reducing the significance of sensitive rules, another objective is leaving unaltered or minimally affected significance of non-sensitive rules. This problem (optimal sanitization) is shown to be NP-hard. Thus, they proposed a heuristic which does a greedy search for each sensitive itemset. Their method is based on starting from the sensitive itemset and selectively removing an item until a singleton is left. Then, they look for common list of transactions that support both the selected item and the initial sensitive itemset to identify the transaction that affects the minimum number of 2-itemsets, and remove the selected item from this transaction.

In [4], the authors propose three strategies which aim at either hiding the frequent sets that participate in given sensitive association rules, or reducing the rules' significance below the minimum confidence threshold. The decrement of the confidence of a rule is achieved by either increasing the support of the rule's antecedent through transactions that partially support it, or decreasing the support of the rule's consequent in transactions supporting both the antecedent and the consequent. The limitation of the approach is the assumption that sensitive itemsets must be disjoint. An extension is presented in [11].

The notion of “unknowns” is introduced in [12], [13]. The goal is to obscure a given set of sensitive rules from being identified, by replacing known values in transactions with unknowns, and then appropriately adjusting the values of these unknowns to minimally affect the non-sensitive rules. An efficient, scalable and one-scan heuristic algorithm, called Sliding Window Algorithm (SWA), is introduced in [14].

The work in [15] proposes two distortion-based heuristic techniques for selectively hiding sensitive rules. The hiding process may introduce a number of side effects, either by generating rules which previously do not exist, or by eliminating existing non-sensitive rules. A technique for hiding “maximal” sensitive patterns using a correlation matrix is introduced in [16]. Instead of selecting individual transactions and sanitizing them, the authors propose a methodology for directly constructing a sanitization matrix M by observing the relationship that holds between sensitive patterns and non-sensitive ones. This matrix is then multiplied by the original database D, yielding a new sanitized database D which eliminates the privacy disclosure concerns. However, D does not guarantee, although limiting the disclosure, that all sensitive itemsets are hidden at the specified disclosure threshold. Hence, it is not hiding-failure free.

Sun et al. [17] introduced the use of border to track the impact of altering transactions w.r.t. the number of lost frequent itemsets. To do so, they work on the itemset lattice to compute the positive and the negative borders. The proposed methodology focuses on preserving the shape of the border, which directly reflects the quality of the sanitized database that is produced. Another border based approach is proposed in [18]. It uses an integer programming optimization algorithm for identifying the minimum number of transactions that needs to be sanitized.

The works by Abul et al. [5], [6], [1] extend the classical knowledge hiding problem to the sequential patterns. In their problem setting, both the transactions and patterns are sequential. Both the one-dimensional sequences and two-dimensional (spatio-temporal trajectory) sequences are supported. Moreover, their methods are able to do sanitization under the presence of the three typical constraints (minimum gap, maximum gap and maximum window) of sequence mining.

While all these different approaches for knowledge hiding focus mainly on frequent itemsets, association rules, and sequential (spatio-temporal) pattern mining, to the best of our knowledge, our work is the first addressing the knowledge hiding problem for tree and graph databases.

Given a tree database, frequent subtree mining is the problem of finding subtrees that exist in significant portion of trees in the database (a.k.a. the forest). The algorithms usually start with smaller trees and prune whenever the support is less than the threshold, otherwise the frequent subtrees are joined to produce larger subtrees for support counting. Efficiency of algorithms is directly related to generation of no candidates more than once and fast support counting of candidates. An efficient algorithm for frequent induced and embedded subtree mining is presented in [19]. The algorithm introduced in [20], [21] is another efficient algorithm for frequent subtree mining. Additionally, the algorithm can also mine graph structured data.

The frequent subgraph mining problem is defined analogously to frequent subtree mining problem. Given a graph database, the problem requires finding all subgraphs which are contained in significant amount of the graphs in the database. The gSpan algorithm developed by Yan and Han [22] finds all frequent subgraphs using a lexicographic ordering technique. Another efficient algorithm is presented in [23].

Tree/graph matching problem, on the other hand, is about finding instances of a pattern (or a query) tree/graph in a (data) tree/graph database. Sometimes, the interest is finding all instances and sometimes just the existence checking. In general, tree matching problem is easier than graph matching problem as the subgraph isomorphism problem is NP-Complete. However, there are different formulations of the matching problem, including (i) induced subtrees/subgraphs, (ii) embedded subtrees/subgraphs, (iii) tree/graph isomorphism and others. Fortunately, almost all variants, especially subtree matching due to XML querying, are extensively studied in the literature [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35].

Section snippets

Frequent itemset and sequential pattern hiding

In this section, we briefly review the formal frequent itemset mining, association rule mining and sequential pattern mining problems and their respective sensitive knowledge hiding problems.

Let I={i1,i2,in} be a set of items (symbols). A transaction T is any non-empty subset of I,i.e.,T2I and a database D is a collection of transactions, D={t1,t2,tm}, where each ti is a transaction. An itemset X is called a k-itemset if |X| = k.

Definition 1 Support

Given an itemset X, the support set of X in database D, denoted

Tree and graph patterns

In this section, after providing the graph and tree preliminaries, we define relevant classes of tree and graph patterns that will be used in the respective knowledge hiding problems. Tree (resp. graph) patterns are patterns that can be expressed as trees (resp. graphs). The keyword relevant emphasizes that there exist respective classes of known frequent substructure mining tools and applications.

Tree hiding

In this section we define the tree hiding problem and our solution approaches for each of the four classes of subtree containment relationships introduced in Section 3.2. Before introducing the tree hiding problem and its solution, we first define an essential subproblem, subtree matching problem, and a straightforward algorithm solving it. We show that the straightforward algorithm is inefficient, and hence we later develop efficient subtree matching algorithms to replace it.

The hiding problem

Graph hiding

If a pattern graph H is an embedded subgraph of data graph G, i.e. H  G, then we also say that G includes (or supports) H. A data graph may include a pattern graph in several ways as there may be multiple ways of inclusion resulting from different isomorphisms. The weighted support for graphs can be defined similarly as done for the trees in Definition 11. For an example, consider the data graph given in Fig. 4(a) and the pattern graph given in Fig. 4(c). From the figure we see that G supports H

Experimental evaluation

In this section, we present our experimental setting and performance results of tree hiding and graph hiding algorithms, implemented in Java. All the experiments are performed on a test computer equipped with 3.0 GHz quad-core Intel CPU and 2 GB RAM, running Windows 7 operating system.

For both of tree and graph hiding experiments we measure three performance metrics: the runtime efficiency, the data distortion (number of masking symbols introduced) M0, and frequent patterns distortion M1, on

Conclusion

Knowledge hiding is an important issue when disclosing databases that may potentially contain sensitive knowledge. After the sensitive knowledge patterns have been identified, a sanitization process is applied before the data publication. Hence, the data receiver cannot resurface the sensitive knowledge through data mining techniques. Even though the problem has been studied for itemsets and sequential patterns, our current work extends knowledge hiding to tree and graph structured

Osman Abul received the PhD degree in computer science from the Middle East Technical University, Ankara, Turkey, in 2005. He is currently an assistant professor of computer science at TOBB University of Economics and Technology, Ankara, Turkey. His research interests include (privacy preserving) data mining and bioinformatics.

References (48)

  • E. Dasseni et al.

    Hiding association rules by using confidence and support

  • O. Abul, M. Atzori, F. Bonchi, F. Giannotti, Hiding sensitive trajectory patterns, in: 6th Int. Workshop on Privacy...
  • O. Abul, M. Atzori, F. Bonchi, F. Giannotti, Hiding sequences, in: Third ICDE Int. Workshop on Privacy Data Management...
  • V.S. Verykios et al.

    Association rule hiding

    IEEE Transactions on Knowledge and Data Engineering

    (2004)
  • Y. Saygin et al.

    Using unknowns to prevent discovery of association rules

    ACM SIGMOD Record

    (2001)
  • Y. Saygin et al.

    Privacy preserving association rule mining

  • S.R.M. Oliveira et al.

    Protecting sensitive knowledge by data sanitization

  • E.D. Pontikakis et al.

    An experimental study of distortion-based techniques for association rule hiding

  • G. Lee, C.-Y. Chang, A. L. P. Chen, Hiding sensitive patterns in association rules mining, in:...
  • X. Sun et al.

    A border-based approach for hiding sensitive frequent itemsets

  • S. Menon et al.

    Maximizing accuracy of shared databases when concealing sensitive patterns

    Information Systems Research

    (2005)
  • M.J. Zaki

    Efficiently mining frequent trees in a forest: algorithms and applications

    IEEE Transactions on Knowledge and Data Engineering

    (2005)
  • S. Nijssen et al.

    A quickstart in frequent structure mining can make a difference

  • X. Yan et al.

    gSpan: graph-based substructure pattern mining

  • Cited by (11)

    View all citing articles on Scopus

    Osman Abul received the PhD degree in computer science from the Middle East Technical University, Ankara, Turkey, in 2005. He is currently an assistant professor of computer science at TOBB University of Economics and Technology, Ankara, Turkey. His research interests include (privacy preserving) data mining and bioinformatics.

    Harun Gökçe received the MSc degree in computer science from the TOBB University of Economics and Technology, Ankara, Turkey, in 2010. He is currently working in industry as an information technology expert. His research interest is in privacy preserving data mining.

    Supported by TUBITAK, project number 108E016.

    View full text