Skip to main content

Advertisement

Log in

Mining transactional tree databases under homeomorphism

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A key task in mining tree-structured data is finding frequent embedded tree patterns, which has two settings: the transactional setting and the per-occurrence setting. In the transactional setting, which is the focus of this paper, the crucial step is to decide whether a tree pattern is subtree homeomorphic to a database tree. Our extensive study on the properties of real-world tree-structured datasets reveals that while many vertices in a database tree may have the same label, no two vertices on the same path are identically labeled. In this paper, we exploit this property and propose a novel and efficient method for deciding whether a tree pattern is subtree homeomorphic to a database tree. Our algorithm is based on a compact data structure called EMET, which stores all information required for subtree homeomorphism. We propose an efficient algorithm to generate EMETs of larger patterns using EMETs of the smaller ones. Based on the proposed subtree homeomorphism method, we introduce TTM, an effective algorithm for finding frequent tree patterns from rooted ordered trees. We evaluate the efficiency of TTM on several real-world and synthetic datasets and show that it outperforms well-known existing algorithms by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Algorithm 2
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

The CSLOGS datasets, along with the tree generator program used to produce synthetic datasets, are publicly available on the internet. The NASA and Prions datasets were provided through email communications (please see the Acknowledgments section).

Code availability

The release of the code is limited by licensing constraints.

Notes

  1. EMET is an abbreviation for EMbedding Encoder for Transactional tree mining.

  2. TTM is an abbreviation for Transactional Tree Miner.

References

  1. Aggarwal CC (2014) Applications of frequent pattern mining, Springer, Cham. pp. 443–467. https://doi.org/10.1007/978-3-319-07821-2_18

  2. Zaki MJ, Aggarwal CC (2006) XRules: an effective algorithm for structural classification of XML data. Mach Learn 62(1–2):137–170

    Article  MATH  Google Scholar 

  3. Chalmers R, Almeroth K (2001) Modeling the branching characteristics and efficiency gains of global multicast trees. In: Proceedings of the 20th IEEE International Conference on Computer Communications (INFOCOM), pp. 449–458

  4. Chalmers RC, Member S, Almeroth KC (2003) On the topology of multicast trees. IEEE/ACM Trans Network 11:153–165

    Article  MATH  Google Scholar 

  5. Sidhu AS, Dillon TS, Chang E (2006) Protein ontology. In: Ma, Z., Chen, J.Y. (eds.) Database Modeling in Biology: Practices and Challenges, pp. 39–60

  6. Punin JR, Krishnamoorthy MS, Zaki MJ (2001) LOGML: log markup language for web usage mining. In: WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, pp. 88–112 https://doi.org/10.1007/3-540-45640-6_5

  7. Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowl Data Eng 17(8):1021–1035

    Article  MATH  Google Scholar 

  8. Chehreghani MH, Bruynooghe M (2016) Mining rooted ordered trees under subtree homeomorphism. Data Min Knowl Discov 30(5):1249–1272. https://doi.org/10.1007/s10618-015-0439-5

    Article  MathSciNet  MATH  Google Scholar 

  9. Tan H, Hadzic F, Dillon TS, Chang E, Feng L (2008) Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans Knowl Discov Data (TKDD) 2(2):43. https://doi.org/10.1145/1376815.1376818

    Article  MATH  Google Scholar 

  10. Chehreghani MH, Chehreghani MH (2016) Transactional tree mining. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I, pp. 182–198. https://doi.org/10.1007/978-3-319-46128-1_12

  11. Diestel R (2010) Graph theory, 4th Edition

  12. Zaki MJ (2005) Efficiently mining frequent embedded unordered trees. Fund Inform 66(1–2):33–52

    MathSciNet  MATH  Google Scholar 

  13. Wu X, Theodoratos D (2018) Efficient discovery of embedded patterns from large attributed trees. In: Database Systems for Advanced Applications - 23rd International conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21–24, 2018, Proceedings, Part II, pp. 558–576. https://doi.org/10.1007/978-3-319-91458-9_34

  14. Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: Proceedings of the Second SIAM International Conference on Data Mining (SDM), pp. 158–174

  15. Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM), pp 509–512

  16. Chi Y, Yang Y, Xia Y, Muntz RR (2004) Cmtreeminer: mining both closed and maximal frequent subtrees. In: Proceedings of the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 63–73

  17. Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Proceedings of the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 441–451

  18. Tatikonda S, Parthasarathy S, Kurc TM (2006) TRIPS and TIDES: new algorithms for tree mining. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM), pp 455–464

  19. Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M (2011) OInduced: an efficient algorithm for mining induced patterns from rooted ordered trees. IEEE Trans Syst, Man, Cybernet, Part A 41(5):1013–1025

    Article  MATH  Google Scholar 

  20. Chehreghani MH (2011) Efficiently mining unordered trees. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), pp 111–120

  21. Pham HS, Nijssen S, Mens K, Di Nucci D, Molderez T, De Roover C, Fabry J, Zaytsev V (2019) Mining patterns in source code using tree mining algorithms. In: Kralj Novak P, Šmuc T, Džeroski S (eds) Discovery Science. Springer, Cham, pp 471–480

    Chapter  Google Scholar 

  22. Yusuke S, Tetsuhiro M, Takayoshi S, Tomoyuki U, Satoshi M, Tetsuji K (2019) Enumeration of maximally frequent ordered tree patterns with height-constrained variables for trees. Trans Inform Process Society of Japan Math Model Appl (TOM) 12(3):78–88

    MATH  Google Scholar 

  23. Wu X, Theodoratos D, Sellis T (2018) From homomorphisms to embeddings: a novel approach for mining embedded patterns from large tree data. Big Data Res 14:37–53. https://doi.org/10.1016/j.bdr.2018.08.001

    Article  MATH  Google Scholar 

  24. Chehreghani MH, Abdessalem T, Bifet A, Bouzbila M (2020) Sampling informative patterns from large single networks. Futur Gener Comput Syst. https://doi.org/10.1016/j.future.2020.01.042

    Article  MATH  Google Scholar 

  25. Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of the 7th International Conference on Discovery Science (DS), pp. 278–289

  26. Cook DJ, Holder LB, Djoko S (1995) Knowledge discovery from structural data. J Intell Inf Syst 5(3):229–248

    Article  MATH  Google Scholar 

  27. Coleman TF, Moré JJ (1984) Estimation of sparse hessian matrices and graph coloring problems. Math Program 28(3):243–270. https://doi.org/10.1007/BF02612334

    Article  MathSciNet  MATH  Google Scholar 

  28. Peng H, Zhang D (2023) Cfgm: an algorithm for closed frequent graph patterns mining. Inf Sci 625:327–341. https://doi.org/10.1016/j.ins.2022.12.089

    Article  MATH  Google Scholar 

  29. Qu W, Yan D, Guo G, Wang X, Zou L, Zhou Y (2020) Parallel mining of frequent subtree patterns. In: Qin, L., Zhang, W., Zhang, Y., Peng, Y., Kato, H., Wang, W., Xiao, C. (eds.) Software Foundations for Data Interoperability and Large Scale Graph Data Analytics - 4th International Workshop, SFDI 2020, and 2nd International Workshop, LSGDA 2020, Held in Conjunction with VLDB 2020, Tokyo, Japan, September 4, 2020, Proceedings. Communications in Computer and Information Science, vol. 1281, pp. 18–32

  30. Yan D, Qu W, Guo G, Wang X (2020) Prefixfpm: A parallel framework for general-purpose frequent pattern mining. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp. 1938–1941. https://doi.org/10.1109/ICDE48307.2020.00208

  31. Yan D, Qu W, Guo G, Wang X, Zhou Y (2022) PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. VLDB J 31(2):253–286. https://doi.org/10.1007/S00778-021-00687-0

    Article  MATH  Google Scholar 

  32. Petegem CV, Demeyere K, Maertens R, Strijbol N, Wever B, Mesuere B, Dawyndt P (2024) Mining patterns in syntax trees to automate code reviews of student solutions for programming exercises. CoRR https://doi.org/10.48550/ARXIV.2405.01579arXiv:2405.01579

  33. Hosseininasab A, Hoeve W-J, Cire AA (2024) Memory-efficient sequential pattern mining with hybrid tries. J Mach Learn Res 25(227):1–29

    MathSciNet  MATH  Google Scholar 

  34. Ying R, Fu T, Wang A, You J, Wang Y, Leskovec J (2024) Representation Learning for Frequent Subgraph Mining. arXiv:abs/2402.14367

  35. Chehreghani MH (2022) Half a decade of graph convolutional networks. Nat Mach Intell 4(3):192–193. https://doi.org/10.1038/S42256-022-00466-8

    Article  MATH  Google Scholar 

  36. Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings. arXiv:abs/1511.06361

  37. Cao H, Mamoulis N, Cheung DW (2005) Mining frequent spatio-temporal sequential patterns. In: Fifth IEEE International Conference on Data Mining (ICDM’05), p 8. https://doi.org/10.1109/ICDM.2005.95

  38. Verhein F (2009) Mining complex spatio-temporal sequence patterns, pp 605–616. https://doi.org/10.1137/1.9781611972795.52

  39. Koutsaki E, Vardakis G, Papadakis N (2023) Spatiotemporal data mining problems and methods. Analytics 2(2):485–508. https://doi.org/10.3390/analytics2020027

    Article  MATH  Google Scholar 

  40. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, pp. 487–499. http://www.vldb.org/conf/1994/P487.PDF

  41. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16–18, Dallas, Texas, USA, pp. 1–12. https://doi.org/10.1145/342009.335372

  42. Dietz PF (1982) Maintaining order in a linked list. In: Proceedings of the 14th ACM Symposium on Theory of Computing (STOC), pp 122–127

  43. Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 71–80

  44. Bille P, Gortz IL (2011) The tree inclusion problem: in linear space and faster. ACM Trans Algor 7(3):1–47

    Article  MathSciNet  MATH  Google Scholar 

  45. Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M, Ghadimi E (2009) Efficient rule based structural algorithms for classification of tree structured data. Intell Data Anal 13(1):165–188. https://doi.org/10.3233/IDA-2009-0361

    Article  MATH  Google Scholar 

  46. Bifet A, Gavaldà R (2008) cMining adaptively frequent closed unlabeled rooted trees in data streams. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pp. 34–42. https://doi.org/10.1145/1401890.1401900

  47. Giannella C, Han J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities. Next Generat Data Min 212:191–212

    MATH  Google Scholar 

  48. Tatikonda S, Parthasarathy S (2009) Mining tree-structured data on multicore systems. Proceed VLDB Endowm (PVLDB) 2(1):694–705

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We are thankful to Prof. Mohammed Javeed Zaki for providing the TreeMinerD code, the CSLOGS datasets and the TreeGenerator program, to Dr Henry Tan for providing the MB3Miner-T code, to Professor Jun-Hong Cui for providing the NASA dataset and to Dr Fedja Hadzic for providing the Prions dataset. Parts of this work were performed, while the second author was at Xerox Research Centre Europe, later known as Naver Labs Europe.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

Mostafa Haghir Chehreghani developed ideas, implemented algorithms, ran experiments, analyzed results, and wrote paper.

Morteza Haghir Chehreghani improved ideas and wrote paper.

Corresponding author

Correspondence to Mostafa Haghir Chehreghani.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haghir Chehreghani, M., Haghir Chehreghani, M. Mining transactional tree databases under homeomorphism. J Supercomput 81, 530 (2025). https://doi.org/10.1007/s11227-025-06997-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-06997-2

Keywords