Bottom-up discovery of frequent rooted unordered subtrees

https://doi.org/10.1016/j.ins.2008.08.020Get rights and content

Abstract

In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner, for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous tree-structure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching.

Introduction

Since being introduced in 1993 by Argawal et al. [1], the problem of frequent pattern mining has received a great deal of attention. Finding frequent patterns in databases is the fundamental operation behind several common data mining tasks including association rule mining [2], [20] and sequential pattern mining [3], [9]. In recent years, semi-structured data have become ubiquitous with the rapid upsoaring in both number and scale of semi-structured data applications such as XML database systems, business transactions, XML middleware systems, and others. Hence, there have been increasing demands for extracting useful frequent patterns from large semi-structured databases [7], [8], [12], [15], [16], [19], [23], [26], [30].

Frequent subtree is one of the most important patterns in semi-structured databases. Mining frequent subtrees has many practical applications in areas such as web mining, bioinformatics, XML document mining, and XML query pattern mining. There are already been works dedicated to the mining of a few different types of tree-structures in the literature, including those for ordered trees [5], [27], unordered trees [6], [28], rooted trees [24], [25], and free trees [13], [14], etc. While each type of trees being mined may find its certain areas of application, we only consider one specific type of tree-structures in this paper. We examine the problem of mining frequent rooted unordered subtrees (FRUST) from a semi-structured database. Mining this kind of subtrees is particularly useful for applications such as analysis of user access patterns in web-sites, or XML query performance optimization. Because a semi-structured database is always analogous to a tree database when considering subtree mining, we will use term semi-structured database and tree database interchangeably in this paper.

Generally, a subtree mining method should consist of two consecutive phases: the first phase is a candidate tree generation phase, which generates non-redundant candidate subtrees; the second phase is a support computation phase, which computes the support (or number of occurrences) of each candidate subtree in the database. A brute force approach incurs exponential complexity as when the dataset grows in both complexity and size, the number of candidates to be considered usually increases exponentially. In order to improve the performance of the mining algorithm, we shall look at novel techniques to reduce the cost in both phases.

A conventional approach to the first phase is to employ the rightmost branch expansion enumeration in a top-down fashion. Unfortunately, this approach is likely to produce many infrequent candidate subtrees, incurring unacceptable costs of database scanning during the second phase. Therefore, one must develop effective schemes to prune infrequent candidates, so as to minimize the size of the output from candidate generation. In the second phase, however, one needs an efficient method to count the number of occurrences of each candidate in the database.

We propose an efficient scheme called BUXMiner to address the two problems mentioned above. Our solution to the problems is briefly described as follows: we first generate a data structure called compressed global tree guide. Then, we perform a bottom-up candidate generation process based on the tree guide. One distinguishing feature of BUXMiner is that we can obtain the support (number of occurrences) of candidates directly from the tree guide. Therefore, we are able to avoid the expensive database scans in the second phase.

In addition to frequent tree mining, we also consider the problem of mining the maximal frequent rooted unordered subtrees (mFRUST). Each frequent rooted subtree can be regarded as a subtree of at least one member in the mFRUST set. One benefit of mining the mFRUST set is that it usually has much smaller cardinality compared to the FRUST set especially when trees in the database are strongly correlated. Therefore, the results of mFRUST may be a better resort for human interpretation and analysis. In our work, we propose a scheme called BUMXMiner for mining the mFRUST.

The main contributions of our work are summarized as following:

  • 1.

    We propose to transform the problem of mining the FRUST on the original tree dataset into another problem of mining the FRUST on a compressed global tree guide. This tree guide is created by firstly merging all subtrees into a global tree guide, and then pruning the latter.

  • 2.

    Based on the compressed global tree guide, we present algorithms to generate the (maximal) FRUST set using a bottom-up approach.

  • 3.

    We implement our mining algorithms and compare them against previous methods in the literature. We conduct extensive experiments to demonstrate the performance improvements in our algorithms compared with the previous ones.

In the preliminary version of this paper [7], we introduced BUXMiner which mines frequent query patterns from XML queries. There are several major extensions and modifications in this extended version: first, the main data structure used for generating the candidate subtrees is different from the one used in the preliminary version. The new data structure causes less memory consumption, and can achieve significant reduction in storage space for dense databases. Second, we introduce an algorithm called BUMXMiner to generate the maximal frequent rooted unordered subtrees (mFRUST) for a tree database. Mining the mFRUST produces smaller result sets and is more preferred in some applications. We also give an example of such applications, namely caching frequent XML query patterns, and illustrate that the query performance improvement using the mFRUST is larger than that using the FRUST. Third, we study the performance of two real-world applications of our proposed tree mining algorithms: one is the analysis of web page access patterns. The other is mining frequent XML query patterns for caching. Finally, we add more detailed descriptions to the proposed techniques, which we omitted in [7] due to space limit.

The rest of the paper is organized as follows. In Section 2 we discuss previous works related to tree mining. Section 3 gives the definition of the problem of mining rooted unordered subtrees. We propose the mining algorithm of FRUST in Section 4 and that of mFRUST in Section 5. Section 6 presents our experimental results. Section 7 shows the performance results of the real-world applications of our mining approach. We conclude our work in Section 8.

Section snippets

Related work

In this section, we first look at the mining algorithms for generic frequent subtrees in tree databases (or their XML forms of the same problem). We will then review a special class of subtree mining algorithms, namely those for frequent rooted subtrees.

Problem statement

Before looking at the mining schemes, we shall firstly state the problems to be solved. We begin by defining the FRUST set. We will then describe how our FRUST mining problem could be transformed from its original tree dataset onto a compressed global tree guide.

Frequent subtree pattern mining

In this section, we describe the BUXMiner algorithm that mines the FRUSTs from a database of unordered trees. Algorithm 1 shows the procedure of the BUXMiner algorithm. There are two main steps for generating frequent subtrees from database D. First, we build the compressed global tree guide by merging the trees and pruning infrequent nodes (Line 1 in Algorithm BUXMiner). Second, we enumerate the frequent rooted subtrees from the compressed global tree guide (Line 2). As the first step has been

Maximal frequent subtree pattern mining

In this section, we describe the BUMXMiner that mines maximal frequent rooted subtrees (mFRUST) from a database of unordered trees. In Algorithm 2 we show the pseudo code of BUMXMiner. Like the BUXMiner algorithm, there are two main steps for finding the maximal frequent subtrees in the database D. First, a compressed global tree guide is constructed from the database. Second, the maximal frequent rooted subtrees are enumerated from bottom to top over the compressed global tree guide. In the

Experiments

In this section we evaluate the effectiveness, efficiency, and scalability of our mining algorithms for both frequent rooted unordered trees and maximal frequent rooted unordered trees. We compare our BUXMiner tree mining algorithm against previous algorithms XQPMinerTID and FastXMiner. We also demonstrate the performance improvement of our maximal frequent tree mining algorithm compared to a naive approach that uses a generate-and-test approach. All the mining algorithms are implemented in the

Results of sample applications

Tree mining is very useful in domains like web mining, XML data management and others. In this section, we present the results of two applications which employ our proposed tree mining algorithms.

Conclusion

In this paper, we proposed BUXMiner, an efficient tree mining method to discover frequent rooted unordered subtrees (FRUST). Meanwhile, we also presented a mining algorithm BUMXMiner based on BUXMiner to mine maximal frequent rooted unordered subtrees (mFRUST). We mine frequent query patterns over a tree schema called compressed global tree guide (CGTG), which can be used to prune infrequent nodes and to merge the compressible nodes. We discover frequent patterns from bottom to top and generate

Acknowledgement

This research was supported by Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT IRT0652), and funded in part by the National Science Foundation of China, in Grant NSFC No. 60603044 and the National High Technology Research and Development Program of China (2006AA010107).

References (31)

  • T. Asai, H. Arimura, T. Uno, S. Nakano, Discovering frequent substructures in large unordered trees, in: Proceedings of...
  • Y.J. Bei, G. Chen, J.X. Dong, BUXMiner: an efficient bottom-up approach to mining XML query patterns, in: Proceedings...
  • D. Braga et al.

    Discovering interesting information in XML data with association rules

  • L. Chen, E.A. Rundensteiner, S. Wang, Xcache – a semantic caching system for XML queries, in: ACM SIGMOD, 2002, p....
  • L. Chen, S.S. Bhowmick, L.T. Chia, Mining positive and negative association rules from XML query patterns for caching,...
  • Cited by (0)

    The preliminary version of this paper has been published in APWEB/WAIM 2007.

    View full text