Abstract
A key task in mining tree-structured data is finding frequent embedded tree patterns, which has two settings: the transactional setting and the per-occurrence setting. In the transactional setting, which is the focus of this paper, the crucial step is to decide whether a tree pattern is subtree homeomorphic to a database tree. Our extensive study on the properties of real-world tree-structured datasets reveals that while many vertices in a database tree may have the same label, no two vertices on the same path are identically labeled. In this paper, we exploit this property and propose a novel and efficient method for deciding whether a tree pattern is subtree homeomorphic to a database tree. Our algorithm is based on a compact data structure called EMET, which stores all information required for subtree homeomorphism. We propose an efficient algorithm to generate EMETs of larger patterns using EMETs of the smaller ones. Based on the proposed subtree homeomorphism method, we introduce TTM, an effective algorithm for finding frequent tree patterns from rooted ordered trees. We evaluate the efficiency of TTM on several real-world and synthetic datasets and show that it outperforms well-known existing algorithms by an order of magnitude.

















Similar content being viewed by others
Data availability
The CSLOGS datasets, along with the tree generator program used to produce synthetic datasets, are publicly available on the internet. The NASA and Prions datasets were provided through email communications (please see the Acknowledgments section).
Code availability
The release of the code is limited by licensing constraints.
Notes
EMET is an abbreviation for EMbedding Encoder for Transactional tree mining.
TTM is an abbreviation for Transactional Tree Miner.
References
Aggarwal CC (2014) Applications of frequent pattern mining, Springer, Cham. pp. 443–467. https://doi.org/10.1007/978-3-319-07821-2_18
Zaki MJ, Aggarwal CC (2006) XRules: an effective algorithm for structural classification of XML data. Mach Learn 62(1–2):137–170
Chalmers R, Almeroth K (2001) Modeling the branching characteristics and efficiency gains of global multicast trees. In: Proceedings of the 20th IEEE International Conference on Computer Communications (INFOCOM), pp. 449–458
Chalmers RC, Member S, Almeroth KC (2003) On the topology of multicast trees. IEEE/ACM Trans Network 11:153–165
Sidhu AS, Dillon TS, Chang E (2006) Protein ontology. In: Ma, Z., Chen, J.Y. (eds.) Database Modeling in Biology: Practices and Challenges, pp. 39–60
Punin JR, Krishnamoorthy MS, Zaki MJ (2001) LOGML: log markup language for web usage mining. In: WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, pp. 88–112 https://doi.org/10.1007/3-540-45640-6_5
Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowl Data Eng 17(8):1021–1035
Chehreghani MH, Bruynooghe M (2016) Mining rooted ordered trees under subtree homeomorphism. Data Min Knowl Discov 30(5):1249–1272. https://doi.org/10.1007/s10618-015-0439-5
Tan H, Hadzic F, Dillon TS, Chang E, Feng L (2008) Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans Knowl Discov Data (TKDD) 2(2):43. https://doi.org/10.1145/1376815.1376818
Chehreghani MH, Chehreghani MH (2016) Transactional tree mining. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I, pp. 182–198. https://doi.org/10.1007/978-3-319-46128-1_12
Diestel R (2010) Graph theory, 4th Edition
Zaki MJ (2005) Efficiently mining frequent embedded unordered trees. Fund Inform 66(1–2):33–52
Wu X, Theodoratos D (2018) Efficient discovery of embedded patterns from large attributed trees. In: Database Systems for Advanced Applications - 23rd International conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21–24, 2018, Proceedings, Part II, pp. 558–576. https://doi.org/10.1007/978-3-319-91458-9_34
Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: Proceedings of the Second SIAM International Conference on Data Mining (SDM), pp. 158–174
Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM), pp 509–512
Chi Y, Yang Y, Xia Y, Muntz RR (2004) Cmtreeminer: mining both closed and maximal frequent subtrees. In: Proceedings of the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 63–73
Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Proceedings of the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 441–451
Tatikonda S, Parthasarathy S, Kurc TM (2006) TRIPS and TIDES: new algorithms for tree mining. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM), pp 455–464
Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M (2011) OInduced: an efficient algorithm for mining induced patterns from rooted ordered trees. IEEE Trans Syst, Man, Cybernet, Part A 41(5):1013–1025
Chehreghani MH (2011) Efficiently mining unordered trees. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), pp 111–120
Pham HS, Nijssen S, Mens K, Di Nucci D, Molderez T, De Roover C, Fabry J, Zaytsev V (2019) Mining patterns in source code using tree mining algorithms. In: Kralj Novak P, Šmuc T, Džeroski S (eds) Discovery Science. Springer, Cham, pp 471–480
Yusuke S, Tetsuhiro M, Takayoshi S, Tomoyuki U, Satoshi M, Tetsuji K (2019) Enumeration of maximally frequent ordered tree patterns with height-constrained variables for trees. Trans Inform Process Society of Japan Math Model Appl (TOM) 12(3):78–88
Wu X, Theodoratos D, Sellis T (2018) From homomorphisms to embeddings: a novel approach for mining embedded patterns from large tree data. Big Data Res 14:37–53. https://doi.org/10.1016/j.bdr.2018.08.001
Chehreghani MH, Abdessalem T, Bifet A, Bouzbila M (2020) Sampling informative patterns from large single networks. Futur Gener Comput Syst. https://doi.org/10.1016/j.future.2020.01.042
Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of the 7th International Conference on Discovery Science (DS), pp. 278–289
Cook DJ, Holder LB, Djoko S (1995) Knowledge discovery from structural data. J Intell Inf Syst 5(3):229–248
Coleman TF, Moré JJ (1984) Estimation of sparse hessian matrices and graph coloring problems. Math Program 28(3):243–270. https://doi.org/10.1007/BF02612334
Peng H, Zhang D (2023) Cfgm: an algorithm for closed frequent graph patterns mining. Inf Sci 625:327–341. https://doi.org/10.1016/j.ins.2022.12.089
Qu W, Yan D, Guo G, Wang X, Zou L, Zhou Y (2020) Parallel mining of frequent subtree patterns. In: Qin, L., Zhang, W., Zhang, Y., Peng, Y., Kato, H., Wang, W., Xiao, C. (eds.) Software Foundations for Data Interoperability and Large Scale Graph Data Analytics - 4th International Workshop, SFDI 2020, and 2nd International Workshop, LSGDA 2020, Held in Conjunction with VLDB 2020, Tokyo, Japan, September 4, 2020, Proceedings. Communications in Computer and Information Science, vol. 1281, pp. 18–32
Yan D, Qu W, Guo G, Wang X (2020) Prefixfpm: A parallel framework for general-purpose frequent pattern mining. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp. 1938–1941. https://doi.org/10.1109/ICDE48307.2020.00208
Yan D, Qu W, Guo G, Wang X, Zhou Y (2022) PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. VLDB J 31(2):253–286. https://doi.org/10.1007/S00778-021-00687-0
Petegem CV, Demeyere K, Maertens R, Strijbol N, Wever B, Mesuere B, Dawyndt P (2024) Mining patterns in syntax trees to automate code reviews of student solutions for programming exercises. CoRR https://doi.org/10.48550/ARXIV.2405.01579arXiv:2405.01579
Hosseininasab A, Hoeve W-J, Cire AA (2024) Memory-efficient sequential pattern mining with hybrid tries. J Mach Learn Res 25(227):1–29
Ying R, Fu T, Wang A, You J, Wang Y, Leskovec J (2024) Representation Learning for Frequent Subgraph Mining. arXiv:abs/2402.14367
Chehreghani MH (2022) Half a decade of graph convolutional networks. Nat Mach Intell 4(3):192–193. https://doi.org/10.1038/S42256-022-00466-8
Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings. arXiv:abs/1511.06361
Cao H, Mamoulis N, Cheung DW (2005) Mining frequent spatio-temporal sequential patterns. In: Fifth IEEE International Conference on Data Mining (ICDM’05), p 8. https://doi.org/10.1109/ICDM.2005.95
Verhein F (2009) Mining complex spatio-temporal sequence patterns, pp 605–616. https://doi.org/10.1137/1.9781611972795.52
Koutsaki E, Vardakis G, Papadakis N (2023) Spatiotemporal data mining problems and methods. Analytics 2(2):485–508. https://doi.org/10.3390/analytics2020027
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, pp. 487–499. http://www.vldb.org/conf/1994/P487.PDF
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16–18, Dallas, Texas, USA, pp. 1–12. https://doi.org/10.1145/342009.335372
Dietz PF (1982) Maintaining order in a linked list. In: Proceedings of the 14th ACM Symposium on Theory of Computing (STOC), pp 122–127
Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 71–80
Bille P, Gortz IL (2011) The tree inclusion problem: in linear space and faster. ACM Trans Algor 7(3):1–47
Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M, Ghadimi E (2009) Efficient rule based structural algorithms for classification of tree structured data. Intell Data Anal 13(1):165–188. https://doi.org/10.3233/IDA-2009-0361
Bifet A, Gavaldà R (2008) cMining adaptively frequent closed unlabeled rooted trees in data streams. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pp. 34–42. https://doi.org/10.1145/1401890.1401900
Giannella C, Han J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities. Next Generat Data Min 212:191–212
Tatikonda S, Parthasarathy S (2009) Mining tree-structured data on multicore systems. Proceed VLDB Endowm (PVLDB) 2(1):694–705
Acknowledgements
We are thankful to Prof. Mohammed Javeed Zaki for providing the TreeMinerD code, the CSLOGS datasets and the TreeGenerator program, to Dr Henry Tan for providing the MB3Miner-T code, to Professor Jun-Hong Cui for providing the NASA dataset and to Dr Fedja Hadzic for providing the Prions dataset. Parts of this work were performed, while the second author was at Xerox Research Centre Europe, later known as Naver Labs Europe.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Mostafa Haghir Chehreghani developed ideas, implemented algorithms, ran experiments, analyzed results, and wrote paper.
Morteza Haghir Chehreghani improved ideas and wrote paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Haghir Chehreghani, M., Haghir Chehreghani, M. Mining transactional tree databases under homeomorphism. J Supercomput 81, 530 (2025). https://doi.org/10.1007/s11227-025-06997-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-06997-2