Abstract
The highly efficient HEP algorithm is a useful tool for mining High Occupancy (HO) item sets. Occupancy is an important measure that describes the interestingness of frequent item sets. The current study examines the efficiency problems in mining HO item sets and proposes an improved HEP algorithm, named advanced HEP (A–HEP), based on set theory rules which eliminate a large number of redundant iterations. The study also proposes a novel adaptive-and-modified HEP (NAM–HEP) algorithm that uses HO Set-Enumeration (SE) trees to store HO item sets. The study proposes definitions for adaptive thresholds such as support threshold and occupancy threshold based on the attributes of the transaction database for efficient pruning of the HO-SE tree. Two pseudo-code blocks are presented in addition to a detailed description of the A–HEP and NAM–HEP algorithms and their advantages. Using the A–HEP and NAM–HEP algorithms, HO item sets are investigated from the practical transaction databases named mushroom and retail. The results indicate that the proposed A–HEP and NAM–HEP algorithms enhance mining performance and runtime benchmarks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Kim S, Kim H, Cho M, Kim H, Vo B, Lin JC, Yun U (2023) Efficient approach for mining high-utility patterns on incremental databases with dynamic profits. Knowl-Based Syst 111060
Gordan M, Sabbagh-Yazdi SR, Ismail Z, Ghaedi K, Carroll P, McCrum D, Samali B (2022) State-of-the-art review on advancements of data mining in structural health monitoring. Measurement, page 110939
Witten IH, Frank E, Hall MA, Pal CJ, Data M (2005) Practical machine learning tools and techniques. In Data Mining, volume 2
Wu X, Xingquan Z, Wu GQ, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Rezwan A, George K (2012) Algorithms for mining the evolution of conserved relational states in dynamic networks. Knowl Inf Syst 33:603–630
Borgatti SP, Mehra A, Brass DJ, Labianca G (2008) Network analysis in the social sciences. Science 323(5916):892–895
Yi-Cheng C, Wen-Chih P, Suh-Yin L (2012) Efficient algorithms for influence maximization in social networks. Knowl Inf Syst 33:577–601
Alzennyr DS, Raja C, Georges H (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23
Thanh-Nam T, Thanh-Long N, Miroslav V (2022) Approaching k-means for multiantenna uav positioning in combination with a max-sic-min-rate framework to enable aerial iot networks. IEEE Access 10:115157–115178
Tran TN, Nguyen TL, Hoang VT, Voznak M (2023) Sensor clustering using a k-means algorithm in combination with optimized unmanned aerial vehicle trajectory in wireless sensor networks. Sensors 23(4):2345
Aggarwal CC (2014) An Introduction to Frequent Pattern Mining, pages 1–17. Springer International Publishing, Cham
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499. Citeseer
Gosta G, Jianfei Z (2005) Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362
ZhiHong D, ZhongHui W, JiaJian J (2012) A new algorithm for fast mining frequent itemsets using n-lists. SCIENCE CHINA Inf Sci 55(9):2008–2030
Zhi-Hong D, Sheng-Long L (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appl 41(10):4505–4512
Zhi-Hong D, Sheng-Long L (2015) Prepost+: An efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432
Tuong L, Bay V (2015) An n-list-based algorithm for mining frequent closed patterns. Expert Syst Appl 42(19):6648–6657
Zhi-Hong D (2016) Diffnodesets: An efficient structure for fast mining frequent itemsets. Appl Soft Comput 41:214–223
Subrata D, Kalyani M, Udit G (2022) High occupancy itemset mining with consideration of transaction occupancy. Arab J Sci Eng 47(2):2061–2075
Tang L, Zhang L, Luo P, Wang M (2012) Incorporating occupancy into frequent pattern mining for high quality pattern recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 75–84
Tung NT, Nguyen TDD, Nguyen LTT, Vo B (2024) An efficient method for mining high-utility itemsets from unstable negative profit databases. Expert Systems with Applications 237:121489
Zhi-Hong D (2020) Mining high occupancy itemsets. Futur Gener Comput Syst 102:222–229
Nguyen A, Nguyen NT, Nguyen LTT, Vo B (2023) An efficient pruning method for mining inter-sequence patterns based on pseudo-idlist. Expert Systems with Applications, page 121738
Kim H, Ryu T, Lee C, Kim S, Vo B, JC Lin, Yun U (2023) Efficient method for mining high utility occupancy patterns based on indexed list structure. IEEE Access 11:43140–43158
Le T, Nguyen TL, Huynh B, Nguyen H, Hong TP, Snasel V (2021) Mining colossal patterns with length constraints. Appl Intell 1–12
Gan W, Lin JC, Fournier-Viger P, Chao HC, Yu PS (2020) Huopm: High-utility occupancy pattern mining. IEEE Transactions on Cybernetics 50(3):1195–1208
Heonho K, Taewoong R, Chanhee L, Hyeonmo K, Tin T, Philippe FV, Witold P, Unil Y (2022) Mining high occupancy patterns to analyze incremental data in intelligent systems. ISA Trans 131:460–475
Rymon R (1992) Search through systematic set enumeration
Chien-Ming C, Lili C, Wensheng G, Lina Q, Weiping D (2021) Discovering high utility-occupancy patterns from uncertain data. Inf Sci 546:1208–1229
Acknowledgements
The transaction databases can be downloaded from FIMI http://fimi.uantwerpen.be/data/.
Funding
The research leading to the results published in this paper was supported by the European Union under the REFRESH project – Research Excellence For Region Sustainability and High-tech Industries, ID No. CZ.10.03.01/00/22 003/0000048 of the European Just Transition Fund, and by the Ministry of Education, Youth and Sports of the Czech Republic (MEYS CZ) under the project SGS, ID No. SP 7/2023, conducted by VSB - Technical University of Ostrava.
Author information
Authors and Affiliations
Contributions
T.-N. Tran, T. H. Vinh, and T.-C. Truong contributed to conceptualization, methodology, validation, formal analysis, investigation, preparation and writing of the original draft and visualizations. M. Voznak contributed to conceptualization, writing and review, supervision, project administration, and acquisition of funds. All authors have read and agree to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare no known competing financial interests or personal relationships that may have influenced the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
NAM–HEP framework
Figure 7 illustrates the procedure of the proposed NAM–HEP algorithm. In step (i), the transaction database is loaded into memory. In the next step, the NAM–HEP algorithm exploits the features of the transaction database; this step performs two scans. In the first scan, the NAM–HEP algorithm searches the entire the transaction database; in the second scan, it searches new transactions only. The features of the database are plotted in charts. Note that step (iii) is the most important since it may lead to unsuitable HO item set mining. The support threshold for pre-pruning the power item set \(\textbf{P}_1\) is determined in step (iv). The items with less support than the support threshold are pre-pruned in step (v). In step (vi), the HO of each item in the power item set is calculated from (5a). Using the HO results, the HO threshold is determined at step (vii) from (10a). The HO item sets are obtained when their occupancy results are greater than or equal to the HO threshold calculated in step (viii); the HO item sets are then stored in the HO-SE tree in step (ix). For the next level \(k>1\), the HO-SE tree is scanned in step (x) to obtain the distinguished HO items in step (xi), which are unionized with leaf nodes at the highest level in the HO-SE tree to generate the power item set at step (xii). The NAM–HEP framework repeats step (vi) until either the power item set is empty, the current HO threshold is less than the previous threshold (line 42, Algorithm 2), the HO item sets are empty (line 40, Algorithm 2), or level k reaches the maximum level K limit given by (8b).
Transaction database features
Figure 8 plots the results of analysis of the mushroom and retail transaction databases, indicating crucial details.
The results show that database mushroom contains 8124 transactions (i.e., \(\left| {{\textbf{T}}} \right| = 8124\)), each with 23 items (i.e., \(\left| {{{\textbf{T}}_1}} \right| = \ldots = \left| {{{\textbf{T}}_{8124}}} \right| = 23\)). Database mushroom can be loaded into the matrix \(\textbf{T}\) of size \(8124 \times 23\). Using (3a) to investigate item support reveals that some items have few supports (e.g., two items \(\{8\}\) and \(\{12\}\) have \( Support\left( \{8\}\right) = Support\left( \{12\}\right) = 4\)), while others have many supports (e.g., item \(\{85\}\) has \( Support\left( \{85\}\right) = 8124\)). This means that item \(\left\{ 85\right\} \) appears in a total of 8124 transactions. The average support and median support are 1570.2 and 600, respectively. The average support is much greater than the median support because some items have a large number of supports that significantly impact the average value. We therefore use the median support given by (13a) as the support threshold instead of the average support. By applying a support threshold \(\delta = 600\) based on the median support, a large number of items can be pruned from the mushroom database. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 59 items can be pre-pruned. This means that 60 items in database mushroom satisfy the support threshold and are considered for HO item set mining.
The results in Figure 8b indicate that the initial analysis finds 16470 distinguished items in database retail, i.e., \(\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} \) and \(\left| \textbf{I}\right| = 16470\). Database retail contains 88162 transactions, i.e., \(\left| \textbf{T}\right| = 88162\). Unlike database mushroom, database retail has different lengths of transactions. Specifically, database retail contains 3116 transactions of one-item, while the transaction \(\textbf{T}_{70925}\) contains 76 items (i.e., \(\left| \textbf{T}_{70925} \right| = 76\)). This means that some transactions are short and others are long. It is thus is difficult to load database retail as a matrix \(\textbf{T}\). In this case, we find the longest transaction (i.e., \(\left| \textbf{T}_{70925} \right| = 76\)) and then add empty cells with “not a number” (NaN) for the other transactions so that they have the same length \(\left| {{\mathbf{{T}}_1}} \right| = \ldots = \left| {{\mathbf{{T}}_{88162}}} \right| = 76\). Consequently, the transaction database \(\textbf{T}\) has a size of \(88162 \times 76\). MATLAB, however, recognizes all NaN values as distinguished items when the database \(\textbf{T}\) is scanned to collect the distinguished item set \(\textbf{I}\). All NaN items must therefore be trimmed to precisely obtain the distinguished item set \(\textbf{I}\) such that \(\textbf{I}\) contains only valid items (i.e., \(\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} \)) belonging to the database \(\textbf{T}\), i.e., all NaN values are trimmed from the distinguished item set \(\textbf{I}\). Further investigation of the item supports shows 2224 items with only 1 support, while item \(\left\{ 39\right\} \) has a large number of supports (i.e., \( Support\left( \left\{ 39\right\} \right) = 50675\)). This means that some items have few supports while others have many. We then obtain the average support and median support values of 55.1655 and 11, respectively. The average support is greater than the median support because some items have many supports and affect the average value. Again, we proposes using the median support as a support threshold instead of the average support. Using expression (13a), we obtain the support threshold \(\delta = 11\) for database retail. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 8229 items are pre-pruned. It means that there are 8241 items. This means that 8241 items in database retail satisfy the support threshold and are considered for HO item set mining.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tran, TN., Truong Hoang, V., Truong, TC. et al. A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold. Appl Intell 55, 205 (2025). https://doi.org/10.1007/s10489-024-06166-7
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06166-7