Skip to main content

Advertisement

A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The highly efficient HEP algorithm is a useful tool for mining High Occupancy (HO) item sets. Occupancy is an important measure that describes the interestingness of frequent item sets. The current study examines the efficiency problems in mining HO item sets and proposes an improved HEP algorithm, named advanced HEP (A–HEP), based on set theory rules which eliminate a large number of redundant iterations. The study also proposes a novel adaptive-and-modified HEP (NAM–HEP) algorithm that uses HO Set-Enumeration (SE) trees to store HO item sets. The study proposes definitions for adaptive thresholds such as support threshold and occupancy threshold based on the attributes of the transaction database for efficient pruning of the HO-SE tree. Two pseudo-code blocks are presented in addition to a detailed description of the A–HEP and NAM–HEP algorithms and their advantages. Using the A–HEP and NAM–HEP algorithms, HO item sets are investigated from the practical transaction databases named mushroom and retail. The results indicate that the proposed A–HEP and NAM–HEP algorithms enhance mining performance and runtime benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Kim S, Kim H, Cho M, Kim H, Vo B, Lin JC, Yun U (2023) Efficient approach for mining high-utility patterns on incremental databases with dynamic profits. Knowl-Based Syst 111060

  2. Gordan M, Sabbagh-Yazdi SR, Ismail Z, Ghaedi K, Carroll P, McCrum D, Samali B (2022) State-of-the-art review on advancements of data mining in structural health monitoring. Measurement, page 110939

  3. Witten IH, Frank E, Hall MA, Pal CJ, Data M (2005) Practical machine learning tools and techniques. In Data Mining, volume 2

  4. Wu X, Xingquan Z, Wu GQ, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  MATH  Google Scholar 

  5. Rezwan A, George K (2012) Algorithms for mining the evolution of conserved relational states in dynamic networks. Knowl Inf Syst 33:603–630

    Article  MATH  Google Scholar 

  6. Borgatti SP, Mehra A, Brass DJ, Labianca G (2008) Network analysis in the social sciences. Science 323(5916):892–895

    Article  MATH  Google Scholar 

  7. Yi-Cheng C, Wen-Chih P, Suh-Yin L (2012) Efficient algorithms for influence maximization in social networks. Knowl Inf Syst 33:577–601

    Article  MATH  Google Scholar 

  8. Alzennyr DS, Raja C, Georges H (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23

    Article  MATH  Google Scholar 

  9. Thanh-Nam T, Thanh-Long N, Miroslav V (2022) Approaching k-means for multiantenna uav positioning in combination with a max-sic-min-rate framework to enable aerial iot networks. IEEE Access 10:115157–115178

    Article  MATH  Google Scholar 

  10. Tran TN, Nguyen TL, Hoang VT, Voznak M (2023) Sensor clustering using a k-means algorithm in combination with optimized unmanned aerial vehicle trajectory in wireless sensor networks. Sensors 23(4):2345

  11. Aggarwal CC (2014) An Introduction to Frequent Pattern Mining, pages 1–17. Springer International Publishing, Cham

  12. Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499. Citeseer

  13. Gosta G, Jianfei Z (2005) Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362

    Article  MATH  Google Scholar 

  14. ZhiHong D, ZhongHui W, JiaJian J (2012) A new algorithm for fast mining frequent itemsets using n-lists. SCIENCE CHINA Inf Sci 55(9):2008–2030

    Article  MathSciNet  MATH  Google Scholar 

  15. Zhi-Hong D, Sheng-Long L (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appl 41(10):4505–4512

    Article  MATH  Google Scholar 

  16. Zhi-Hong D, Sheng-Long L (2015) Prepost+: An efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432

    Article  MATH  Google Scholar 

  17. Tuong L, Bay V (2015) An n-list-based algorithm for mining frequent closed patterns. Expert Syst Appl 42(19):6648–6657

    Article  MATH  Google Scholar 

  18. Zhi-Hong D (2016) Diffnodesets: An efficient structure for fast mining frequent itemsets. Appl Soft Comput 41:214–223

    Article  MATH  Google Scholar 

  19. Subrata D, Kalyani M, Udit G (2022) High occupancy itemset mining with consideration of transaction occupancy. Arab J Sci Eng 47(2):2061–2075

    Article  MATH  Google Scholar 

  20. Tang L, Zhang L, Luo P, Wang M (2012) Incorporating occupancy into frequent pattern mining for high quality pattern recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 75–84

  21. Tung NT, Nguyen TDD, Nguyen LTT, Vo B (2024) An efficient method for mining high-utility itemsets from unstable negative profit databases. Expert Systems with Applications 237:121489

    Article  MATH  Google Scholar 

  22. Zhi-Hong D (2020) Mining high occupancy itemsets. Futur Gener Comput Syst 102:222–229

    Article  MATH  Google Scholar 

  23. Nguyen A, Nguyen NT, Nguyen LTT, Vo B (2023) An efficient pruning method for mining inter-sequence patterns based on pseudo-idlist. Expert Systems with Applications, page 121738

  24. Kim H, Ryu T, Lee C, Kim S, Vo B, JC Lin, Yun U (2023) Efficient method for mining high utility occupancy patterns based on indexed list structure. IEEE Access 11:43140–43158

  25. Le T, Nguyen TL, Huynh B, Nguyen H, Hong TP, Snasel V (2021) Mining colossal patterns with length constraints. Appl Intell 1–12

  26. Gan W, Lin JC, Fournier-Viger P, Chao HC, Yu PS (2020) Huopm: High-utility occupancy pattern mining. IEEE Transactions on Cybernetics 50(3):1195–1208

    Article  MATH  Google Scholar 

  27. Heonho K, Taewoong R, Chanhee L, Hyeonmo K, Tin T, Philippe FV, Witold P, Unil Y (2022) Mining high occupancy patterns to analyze incremental data in intelligent systems. ISA Trans 131:460–475

    Article  MATH  Google Scholar 

  28. Rymon R (1992) Search through systematic set enumeration

  29. Chien-Ming C, Lili C, Wensheng G, Lina Q, Weiping D (2021) Discovering high utility-occupancy patterns from uncertain data. Inf Sci 546:1208–1229

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The transaction databases can be downloaded from FIMI http://fimi.uantwerpen.be/data/.

Funding

The research leading to the results published in this paper was supported by the European Union under the REFRESH project – Research Excellence For Region Sustainability and High-tech Industries, ID No. CZ.10.03.01/00/22 003/0000048 of the European Just Transition Fund, and by the Ministry of Education, Youth and Sports of the Czech Republic (MEYS CZ) under the project SGS, ID No. SP 7/2023, conducted by VSB - Technical University of Ostrava.

Author information

Authors and Affiliations

Authors

Contributions

T.-N. Tran, T. H. Vinh, and T.-C. Truong contributed to conceptualization, methodology, validation, formal analysis, investigation, preparation and writing of the original draft and visualizations. M. Voznak contributed to conceptualization, writing and review, supervision, project administration, and acquisition of funds. All authors have read and agree to the published version of the manuscript.

Corresponding author

Correspondence to Vinh Truong Hoang.

Ethics declarations

Conflicts of Interest

The authors declare no known competing financial interests or personal relationships that may have influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

NAM–HEP framework

Figure 7 illustrates the procedure of the proposed NAM–HEP algorithm. In step (i), the transaction database is loaded into memory. In the next step, the NAM–HEP algorithm exploits the features of the transaction database; this step performs two scans. In the first scan, the NAM–HEP algorithm searches the entire the transaction database; in the second scan, it searches new transactions only. The features of the database are plotted in charts. Note that step (iii) is the most important since it may lead to unsuitable HO item set mining. The support threshold for pre-pruning the power item set \(\textbf{P}_1\) is determined in step (iv). The items with less support than the support threshold are pre-pruned in step (v). In step (vi), the HO of each item in the power item set is calculated from (5a). Using the HO results, the HO threshold is determined at step (vii) from (10a). The HO item sets are obtained when their occupancy results are greater than or equal to the HO threshold calculated in step (viii); the HO item sets are then stored in the HO-SE tree in step (ix). For the next level \(k>1\), the HO-SE tree is scanned in step (x) to obtain the distinguished HO items in step (xi), which are unionized with leaf nodes at the highest level in the HO-SE tree to generate the power item set at step (xii). The NAM–HEP framework repeats step (vi) until either the power item set is empty, the current HO threshold is less than the previous threshold (line 42, Algorithm 2), the HO item sets are empty (line 40, Algorithm 2), or level k reaches the maximum level K limit given by (8b).

Fig. 7
figure 7

The NAM–HEP framework

Transaction database features

Figure 8 plots the results of analysis of the mushroom and retail transaction databases, indicating crucial details.

Fig. 8
figure 8

Item supports in databases (a) mushroom and (b) retail

The results show that database mushroom contains 8124 transactions (i.e., \(\left| {{\textbf{T}}} \right| = 8124\)), each with 23 items (i.e., \(\left| {{{\textbf{T}}_1}} \right| = \ldots = \left| {{{\textbf{T}}_{8124}}} \right| = 23\)). Database mushroom can be loaded into the matrix \(\textbf{T}\) of size \(8124 \times 23\). Using (3a) to investigate item support reveals that some items have few supports (e.g., two items \(\{8\}\) and \(\{12\}\) have \( Support\left( \{8\}\right) = Support\left( \{12\}\right) = 4\)), while others have many supports (e.g., item \(\{85\}\) has \( Support\left( \{85\}\right) = 8124\)). This means that item \(\left\{ 85\right\} \) appears in a total of 8124 transactions. The average support and median support are 1570.2 and 600, respectively. The average support is much greater than the median support because some items have a large number of supports that significantly impact the average value. We therefore use the median support given by (13a) as the support threshold instead of the average support. By applying a support threshold \(\delta = 600\) based on the median support, a large number of items can be pruned from the mushroom database. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 59 items can be pre-pruned. This means that 60 items in database mushroom satisfy the support threshold and are considered for HO item set mining.

The results in Figure 8b indicate that the initial analysis finds 16470 distinguished items in database retail, i.e., \(\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} \) and \(\left| \textbf{I}\right| = 16470\). Database retail contains 88162 transactions, i.e., \(\left| \textbf{T}\right| = 88162\). Unlike database mushroom, database retail has different lengths of transactions. Specifically, database retail contains 3116 transactions of one-item, while the transaction \(\textbf{T}_{70925}\) contains 76 items (i.e., \(\left| \textbf{T}_{70925} \right| = 76\)). This means that some transactions are short and others are long. It is thus is difficult to load database retail as a matrix \(\textbf{T}\). In this case, we find the longest transaction (i.e., \(\left| \textbf{T}_{70925} \right| = 76\)) and then add empty cells with “not a number” (NaN) for the other transactions so that they have the same length \(\left| {{\mathbf{{T}}_1}} \right| = \ldots = \left| {{\mathbf{{T}}_{88162}}} \right| = 76\). Consequently, the transaction database \(\textbf{T}\) has a size of \(88162 \times 76\). MATLAB, however, recognizes all NaN values as distinguished items when the database \(\textbf{T}\) is scanned to collect the distinguished item set \(\textbf{I}\). All NaN items must therefore be trimmed to precisely obtain the distinguished item set \(\textbf{I}\) such that \(\textbf{I}\) contains only valid items (i.e., \(\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} \)) belonging to the database \(\textbf{T}\), i.e., all NaN values are trimmed from the distinguished item set \(\textbf{I}\). Further investigation of the item supports shows 2224 items with only 1 support, while item \(\left\{ 39\right\} \) has a large number of supports (i.e., \( Support\left( \left\{ 39\right\} \right) = 50675\)). This means that some items have few supports while others have many. We then obtain the average support and median support values of 55.1655 and 11, respectively. The average support is greater than the median support because some items have many supports and affect the average value. Again, we proposes using the median support as a support threshold instead of the average support. Using expression (13a), we obtain the support threshold \(\delta = 11\) for database retail. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 8229 items are pre-pruned. It means that there are 8241 items. This means that 8241 items in database retail satisfy the support threshold and are considered for HO item set mining.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, TN., Truong Hoang, V., Truong, TC. et al. A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold. Appl Intell 55, 205 (2025). https://doi.org/10.1007/s10489-024-06166-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06166-7

Keywords