Abstract
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based on the BFCF, we develop a vertical mining algorithm, RootedTreeMiner, to discover all frequently occurring subtrees in a database of labelled rooted unordered trees. The RootedTreeMiner algorithm uses an enumeration tree to enumerate all (frequent) labelled rooted unordered subtrees. Next, we extend the definition of the DFCF to labelled free trees and present an Apriori-like algorithm, FreeTreeMiner, to discover all frequently occurring subtrees in a database of labelled free trees. Finally, we study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from real applications.
Similar content being viewed by others
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB’94)
Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distribut Comput 61(3):350–371
Aho AV, Hopcroft JE, Ullman JE (1974) The design and analysis of computer algorithms. Addison-Wesley
Aldous JM, Wilson RJ (2000) Graphs and applications. An introductory approach. Springer, Berlin Heidelberg New York
Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data mining
Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th international conference on discovery science
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD
Buss SR (1997) A log time algorithms for tree isomorphism, comparison, and canonization. In: Computational logic and proof theory, 5th Kurt Gödel Colloquium (KGC’97). Lecture notes in computer science, vol 1289. Springer, Berlin Heidelberg New York, pp 18–33
Chen Z, Jagadish HV, Korn F, Koudas N, Muthukrishnan S, Ng RT, Srivastava D (2001) Counting twig matches in a tree. In: ICDE’01, pp 595–604
Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the 2003 IEEE international conference on data mining (ICDM’03)
Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: 16th international conference on scientific and statistical database management (SSDBM’04)
Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: Mining both closed and maximal frequent subtrees. In: 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD’04)
Chung MJ (1987) O(n2.5) time algorithm for subgraph homeomorphism problem on trees. J Algorithm 8:106–112
Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of IFIP networking 2002
Garey MR, Johnson DS (1979) Computers and intractability—A guide to the theory of np-completeness. Freeman, New York
Hein J, Jiang T, Wang L, Zhang K (1996) On the complexity of comparing evolutionary trees. Discret Appl Math 71:153–169
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraph in the presence of isomorphism. In: Proceedings of the 2003 international conference on data mining (ICDM’03)
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 13–23
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM’01)
Liu T, Geiger D (1999) Approximate tree matching and shape similarity. In: International conference on computer vision
Medina A, Lakhina A, Matta I, Byers J (2001) Brite: universal topology generation from a user’s perspective. Technical report BUCS-TR2001-003, Boston University
(NCI), N C I (2003) DTP/2D and 3D structural information. World Wide Web, ftp://dtpsearch.ncifcrf.gov/jan03_2d.bin
Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: 1st international workshop on mining graphs, trees and sequences
Punin J, Krishnamoorthy M (1998) WWWPal system—a system for analysis and synthesis of web pages. In: WebNet 98 conference
Rückert U, Kramer S (2004) Frequent free tree discovery in graph data. In: Special track on data mining, ACM symposium on applied computing (SAC’04)
Setubal JC (1996) Sequential and parallel experimental results with bipartite matching algorithms. Technical report IC-96-09, Institute of Computing, State University of Campinas (Brazil)
Shasha D, Wang JTL, Giugno R (2002) Algorithmics and applications of tree and graph searching. In: Symposium on principles of database systems, pp 39–52
Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM’02), pp 450–457
Valiente G (2002) Algorithms on trees and graphs. Springer, Berlin Heidelberg New York
Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2002 international conference on data mining (ICDM’02)
Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: Proceedings of 2003 international conference knowledge discovery and data mining (SIGKDD’03)
Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chi, Y., Yang, Y. & Muntz, R. Canonical forms for labelled trees and their applications in frequent subtree mining. Knowl Inf Syst 8, 203–234 (2005). https://doi.org/10.1007/s10115-004-0180-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0180-7