Skip to main content
Log in

Canonical forms for labelled trees and their applications in frequent subtree mining

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based on the BFCF, we develop a vertical mining algorithm, RootedTreeMiner, to discover all frequently occurring subtrees in a database of labelled rooted unordered trees. The RootedTreeMiner algorithm uses an enumeration tree to enumerate all (frequent) labelled rooted unordered subtrees. Next, we extend the definition of the DFCF to labelled free trees and present an Apriori-like algorithm, FreeTreeMiner, to discover all frequently occurring subtrees in a database of labelled free trees. Finally, we study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from real applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB’94)

  2. Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distribut Comput 61(3):350–371

    Article  Google Scholar 

  3. Aho AV, Hopcroft JE, Ullman JE (1974) The design and analysis of computer algorithms. Addison-Wesley

  4. Aldous JM, Wilson RJ (2000) Graphs and applications. An introductory approach. Springer, Berlin Heidelberg New York

  5. Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data mining

  6. Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th international conference on discovery science

  7. Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD

  8. Buss SR (1997) A log time algorithms for tree isomorphism, comparison, and canonization. In: Computational logic and proof theory, 5th Kurt Gödel Colloquium (KGC’97). Lecture notes in computer science, vol 1289. Springer, Berlin Heidelberg New York, pp 18–33

  9. Chen Z, Jagadish HV, Korn F, Koudas N, Muthukrishnan S, Ng RT, Srivastava D (2001) Counting twig matches in a tree. In: ICDE’01, pp 595–604

  10. Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the 2003 IEEE international conference on data mining (ICDM’03)

  11. Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: 16th international conference on scientific and statistical database management (SSDBM’04)

  12. Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: Mining both closed and maximal frequent subtrees. In: 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD’04)

  13. Chung MJ (1987) O(n2.5) time algorithm for subgraph homeomorphism problem on trees. J Algorithm 8:106–112

    Article  Google Scholar 

  14. Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of IFIP networking 2002

  15. Garey MR, Johnson DS (1979) Computers and intractability—A guide to the theory of np-completeness. Freeman, New York

  16. Hein J, Jiang T, Wang L, Zhang K (1996) On the complexity of comparing evolutionary trees. Discret Appl Math 71:153–169

    Article  MathSciNet  Google Scholar 

  17. Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraph in the presence of isomorphism. In: Proceedings of the 2003 international conference on data mining (ICDM’03)

  18. Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 13–23

  19. Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM’01)

  20. Liu T, Geiger D (1999) Approximate tree matching and shape similarity. In: International conference on computer vision

  21. Medina A, Lakhina A, Matta I, Byers J (2001) Brite: universal topology generation from a user’s perspective. Technical report BUCS-TR2001-003, Boston University

  22. (NCI), N C I (2003) DTP/2D and 3D structural information. World Wide Web, ftp://dtpsearch.ncifcrf.gov/jan03_2d.bin

  23. Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: 1st international workshop on mining graphs, trees and sequences

  24. Punin J, Krishnamoorthy M (1998) WWWPal system—a system for analysis and synthesis of web pages. In: WebNet 98 conference

  25. Rückert U, Kramer S (2004) Frequent free tree discovery in graph data. In: Special track on data mining, ACM symposium on applied computing (SAC’04)

  26. Setubal JC (1996) Sequential and parallel experimental results with bipartite matching algorithms. Technical report IC-96-09, Institute of Computing, State University of Campinas (Brazil)

  27. Shasha D, Wang JTL, Giugno R (2002) Algorithmics and applications of tree and graph searching. In: Symposium on principles of database systems, pp 39–52

  28. Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM’02), pp 450–457

  29. Valiente G (2002) Algorithms on trees and graphs. Springer, Berlin Heidelberg New York

  30. Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2002 international conference on data mining (ICDM’02)

  31. Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: Proceedings of 2003 international conference knowledge discovery and data mining (SIGKDD’03)

  32. Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Chi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, Y., Yang, Y. & Muntz, R. Canonical forms for labelled trees and their applications in frequent subtree mining. Knowl Inf Syst 8, 203–234 (2005). https://doi.org/10.1007/s10115-004-0180-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0180-7

Keywords

Navigation