Canonical forms for labelled trees and their applications in frequent subtree mining

Chi, Yun; Yang, Yirong; Muntz, Richard R.

doi:10.1007/s10115-004-0180-7

Canonical forms for labelled trees and their applications in frequent subtree mining

Published: 01 August 2005

Volume 8, pages 203–234, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yun Chi¹,
Yirong Yang¹ &
Richard R. Muntz¹

285 Accesses
55 Citations
6 Altmetric
Explore all metrics

Abstract

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based on the BFCF, we develop a vertical mining algorithm, RootedTreeMiner, to discover all frequently occurring subtrees in a database of labelled rooted unordered trees. The RootedTreeMiner algorithm uses an enumeration tree to enumerate all (frequent) labelled rooted unordered subtrees. Next, we extend the definition of the DFCF to labelled free trees and present an Apriori-like algorithm, FreeTreeMiner, to discover all frequently occurring subtrees in a database of labelled free trees. Finally, we study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from real applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB’94)
Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distribut Comput 61(3):350–371
Article Google Scholar
Aho AV, Hopcroft JE, Ullman JE (1974) The design and analysis of computer algorithms. Addison-Wesley
Aldous JM, Wilson RJ (2000) Graphs and applications. An introductory approach. Springer, Berlin Heidelberg New York
Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data mining
Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th international conference on discovery science
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD
Buss SR (1997) A log time algorithms for tree isomorphism, comparison, and canonization. In: Computational logic and proof theory, 5th Kurt Gödel Colloquium (KGC’97). Lecture notes in computer science, vol 1289. Springer, Berlin Heidelberg New York, pp 18–33
Chen Z, Jagadish HV, Korn F, Koudas N, Muthukrishnan S, Ng RT, Srivastava D (2001) Counting twig matches in a tree. In: ICDE’01, pp 595–604
Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the 2003 IEEE international conference on data mining (ICDM’03)
Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: 16th international conference on scientific and statistical database management (SSDBM’04)
Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: Mining both closed and maximal frequent subtrees. In: 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD’04)
Chung MJ (1987) O(n^2.5) time algorithm for subgraph homeomorphism problem on trees. J Algorithm 8:106–112
Article Google Scholar
Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of IFIP networking 2002
Garey MR, Johnson DS (1979) Computers and intractability—A guide to the theory of np-completeness. Freeman, New York
Hein J, Jiang T, Wang L, Zhang K (1996) On the complexity of comparing evolutionary trees. Discret Appl Math 71:153–169
Article MathSciNet Google Scholar
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraph in the presence of isomorphism. In: Proceedings of the 2003 international conference on data mining (ICDM’03)
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 13–23
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM’01)
Liu T, Geiger D (1999) Approximate tree matching and shape similarity. In: International conference on computer vision
Medina A, Lakhina A, Matta I, Byers J (2001) Brite: universal topology generation from a user’s perspective. Technical report BUCS-TR2001-003, Boston University
(NCI), N C I (2003) DTP/2D and 3D structural information. World Wide Web, ftp://dtpsearch.ncifcrf.gov/jan03_2d.bin
Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: 1st international workshop on mining graphs, trees and sequences
Punin J, Krishnamoorthy M (1998) WWWPal system—a system for analysis and synthesis of web pages. In: WebNet 98 conference
Rückert U, Kramer S (2004) Frequent free tree discovery in graph data. In: Special track on data mining, ACM symposium on applied computing (SAC’04)
Setubal JC (1996) Sequential and parallel experimental results with bipartite matching algorithms. Technical report IC-96-09, Institute of Computing, State University of Campinas (Brazil)
Shasha D, Wang JTL, Giugno R (2002) Algorithmics and applications of tree and graph searching. In: Symposium on principles of database systems, pp 39–52
Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM’02), pp 450–457
Valiente G (2002) Algorithms on trees and graphs. Springer, Berlin Heidelberg New York
Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2002 international conference on data mining (ICDM’02)
Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: Proceedings of 2003 international conference knowledge discovery and data mining (SIGKDD’03)
Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California, Los Angeles, CA, 90095, USA
Yun Chi, Yirong Yang & Richard R. Muntz

Authors

Yun Chi
View author publications
You can also search for this author in PubMed Google Scholar
Yirong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Richard R. Muntz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Chi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, Y., Yang, Y. & Muntz, R. Canonical forms for labelled trees and their applications in frequent subtree mining. Knowl Inf Syst 8, 203–234 (2005). https://doi.org/10.1007/s10115-004-0180-7

Download citation

Received: 20 September 2003
Revised: 09 April 2004
Accepted: 08 May 2004
Published: 01 August 2005
Issue Date: August 2005
DOI: https://doi.org/10.1007/s10115-004-0180-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Canonical forms for labelled trees and their applications in frequent subtree mining

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

A survey of density based clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Canonical forms for labelled trees and their applications in frequent subtree mining

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

A survey of density based clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation