A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold

Tran, Thanh-Nam; Truong Hoang, Vinh; Truong, Thanh-Cong; Voznak, Miroslav

doi:10.1007/s10489-024-06166-7

A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold

Published: 23 December 2024

Volume 55, article number 205, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

62 Accesses
Explore all metrics

Abstract

The highly efficient HEP algorithm is a useful tool for mining High Occupancy (HO) item sets. Occupancy is an important measure that describes the interestingness of frequent item sets. The current study examines the efficiency problems in mining HO item sets and proposes an improved HEP algorithm, named advanced HEP (A–HEP), based on set theory rules which eliminate a large number of redundant iterations. The study also proposes a novel adaptive-and-modified HEP (NAM–HEP) algorithm that uses HO Set-Enumeration (SE) trees to store HO item sets. The study proposes definitions for adaptive thresholds such as support threshold and occupancy threshold based on the attributes of the transaction database for efficient pruning of the HO-SE tree. Two pseudo-code blocks are presented in addition to a detailed description of the A–HEP and NAM–HEP algorithms and their advantages. Using the A–HEP and NAM–HEP algorithms, HO item sets are investigated from the practical transaction databases named mushroom and retail. The results indicate that the proposed A–HEP and NAM–HEP algorithms enhance mining performance and runtime benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High Occupancy Itemset Mining with Consideration of Transaction Occupancy

Article 09 September 2021

Efficient algorithms to mine concise representations of frequent high utility occupancy patterns

Article 18 March 2024

HAUOPM: High Average Utility Occupancy Pattern Mining

Article 21 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Kim S, Kim H, Cho M, Kim H, Vo B, Lin JC, Yun U (2023) Efficient approach for mining high-utility patterns on incremental databases with dynamic profits. Knowl-Based Syst 111060
Gordan M, Sabbagh-Yazdi SR, Ismail Z, Ghaedi K, Carroll P, McCrum D, Samali B (2022) State-of-the-art review on advancements of data mining in structural health monitoring. Measurement, page 110939
Witten IH, Frank E, Hall MA, Pal CJ, Data M (2005) Practical machine learning tools and techniques. In Data Mining, volume 2
Wu X, Xingquan Z, Wu GQ, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Article MATH Google Scholar
Rezwan A, George K (2012) Algorithms for mining the evolution of conserved relational states in dynamic networks. Knowl Inf Syst 33:603–630
Article MATH Google Scholar
Borgatti SP, Mehra A, Brass DJ, Labianca G (2008) Network analysis in the social sciences. Science 323(5916):892–895
Article MATH Google Scholar
Yi-Cheng C, Wen-Chih P, Suh-Yin L (2012) Efficient algorithms for influence maximization in social networks. Knowl Inf Syst 33:577–601
Article MATH Google Scholar
Alzennyr DS, Raja C, Georges H (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23
Article MATH Google Scholar
Thanh-Nam T, Thanh-Long N, Miroslav V (2022) Approaching k-means for multiantenna uav positioning in combination with a max-sic-min-rate framework to enable aerial iot networks. IEEE Access 10:115157–115178
Article MATH Google Scholar
Tran TN, Nguyen TL, Hoang VT, Voznak M (2023) Sensor clustering using a k-means algorithm in combination with optimized unmanned aerial vehicle trajectory in wireless sensor networks. Sensors 23(4):2345
Aggarwal CC (2014) An Introduction to Frequent Pattern Mining, pages 1–17. Springer International Publishing, Cham
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499. Citeseer
Gosta G, Jianfei Z (2005) Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362
Article MATH Google Scholar
ZhiHong D, ZhongHui W, JiaJian J (2012) A new algorithm for fast mining frequent itemsets using n-lists. SCIENCE CHINA Inf Sci 55(9):2008–2030
Article MathSciNet MATH Google Scholar
Zhi-Hong D, Sheng-Long L (2014) Fast mining frequent itemsets using nodesets. Expert Syst Appl 41(10):4505–4512
Article MATH Google Scholar
Zhi-Hong D, Sheng-Long L (2015) Prepost+: An efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432
Article MATH Google Scholar
Tuong L, Bay V (2015) An n-list-based algorithm for mining frequent closed patterns. Expert Syst Appl 42(19):6648–6657
Article MATH Google Scholar
Zhi-Hong D (2016) Diffnodesets: An efficient structure for fast mining frequent itemsets. Appl Soft Comput 41:214–223
Article MATH Google Scholar
Subrata D, Kalyani M, Udit G (2022) High occupancy itemset mining with consideration of transaction occupancy. Arab J Sci Eng 47(2):2061–2075
Article MATH Google Scholar
Tang L, Zhang L, Luo P, Wang M (2012) Incorporating occupancy into frequent pattern mining for high quality pattern recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 75–84
Tung NT, Nguyen TDD, Nguyen LTT, Vo B (2024) An efficient method for mining high-utility itemsets from unstable negative profit databases. Expert Systems with Applications 237:121489
Article MATH Google Scholar
Zhi-Hong D (2020) Mining high occupancy itemsets. Futur Gener Comput Syst 102:222–229
Article MATH Google Scholar
Nguyen A, Nguyen NT, Nguyen LTT, Vo B (2023) An efficient pruning method for mining inter-sequence patterns based on pseudo-idlist. Expert Systems with Applications, page 121738
Kim H, Ryu T, Lee C, Kim S, Vo B, JC Lin, Yun U (2023) Efficient method for mining high utility occupancy patterns based on indexed list structure. IEEE Access 11:43140–43158
Le T, Nguyen TL, Huynh B, Nguyen H, Hong TP, Snasel V (2021) Mining colossal patterns with length constraints. Appl Intell 1–12
Gan W, Lin JC, Fournier-Viger P, Chao HC, Yu PS (2020) Huopm: High-utility occupancy pattern mining. IEEE Transactions on Cybernetics 50(3):1195–1208
Article MATH Google Scholar
Heonho K, Taewoong R, Chanhee L, Hyeonmo K, Tin T, Philippe FV, Witold P, Unil Y (2022) Mining high occupancy patterns to analyze incremental data in intelligent systems. ISA Trans 131:460–475
Article MATH Google Scholar
Rymon R (1992) Search through systematic set enumeration
Chien-Ming C, Lili C, Wensheng G, Lina Q, Weiping D (2021) Discovering high utility-occupancy patterns from uncertain data. Inf Sci 546:1208–1229
Article MathSciNet Google Scholar

Download references

Acknowledgements

The transaction databases can be downloaded from FIMI http://fimi.uantwerpen.be/data/.

Funding

The research leading to the results published in this paper was supported by the European Union under the REFRESH project – Research Excellence For Region Sustainability and High-tech Industries, ID No. CZ.10.03.01/00/22 003/0000048 of the European Just Transition Fund, and by the Ministry of Education, Youth and Sports of the Czech Republic (MEYS CZ) under the project SGS, ID No. SP 7/2023, conducted by VSB - Technical University of Ostrava.

Author information

Authors and Affiliations

Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, No. 19 Nguyen Huu Tho, Ho Chi Minh, 70000, Viet Nam
Thanh-Nam Tran
Faculty of Information Technology, Ho Chi Minh City Open University, No. 97 Vo Van Tan Street, Ho Chi Minh, 70000, Viet Nam
Vinh Truong Hoang
Faculty of Data Science, University of Finance - Marketing, No. 778 Nguyen Kiem, Ho Chi Minh, 70000, Viet Nam
Thanh-Cong Truong
Faculty of Electrical Engineering and Computer Science, Technical University of Ostrava, 17. listopadu 2172/15, Ostrava, 70800, Czech Republic
Miroslav Voznak

Authors

Thanh-Nam Tran
View author publications
You can also search for this author in PubMed Google Scholar
Vinh Truong Hoang
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Cong Truong
View author publications
You can also search for this author in PubMed Google Scholar
Miroslav Voznak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.-N. Tran, T. H. Vinh, and T.-C. Truong contributed to conceptualization, methodology, validation, formal analysis, investigation, preparation and writing of the original draft and visualizations. M. Voznak contributed to conceptualization, writing and review, supervision, project administration, and acquisition of funds. All authors have read and agree to the published version of the manuscript.

Corresponding author

Correspondence to Vinh Truong Hoang.

Ethics declarations

Conflicts of Interest

The authors declare no known competing financial interests or personal relationships that may have influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

NAM–HEP framework

Figure 7 illustrates the procedure of the proposed NAM–HEP algorithm. In step (i), the transaction database is loaded into memory. In the next step, the NAM–HEP algorithm exploits the features of the transaction database; this step performs two scans. In the first scan, the NAM–HEP algorithm searches the entire the transaction database; in the second scan, it searches new transactions only. The features of the database are plotted in charts. Note that step (iii) is the most important since it may lead to unsuitable HO item set mining. The support threshold for pre-pruning the power item set $\textbf{P}_1$ is determined in step (iv). The items with less support than the support threshold are pre-pruned in step (v). In step (vi), the HO of each item in the power item set is calculated from (5a). Using the HO results, the HO threshold is determined at step (vii) from (10a). The HO item sets are obtained when their occupancy results are greater than or equal to the HO threshold calculated in step (viii); the HO item sets are then stored in the HO-SE tree in step (ix). For the next level $k>1$, the HO-SE tree is scanned in step (x) to obtain the distinguished HO items in step (xi), which are unionized with leaf nodes at the highest level in the HO-SE tree to generate the power item set at step (xii). The NAM–HEP framework repeats step (vi) until either the power item set is empty, the current HO threshold is less than the previous threshold (line 42, Algorithm 2), the HO item sets are empty (line 40, Algorithm 2), or level k reaches the maximum level K limit given by (8b).

Transaction database features

Figure 8 plots the results of analysis of the mushroom and retail transaction databases, indicating crucial details.

The results show that database mushroom contains 8124 transactions (i.e., $\left| {{\textbf{T}}} \right| = 8124$), each with 23 items (i.e., $\left| {{{\textbf{T}}_1}} \right| = \ldots = \left| {{{\textbf{T}}_{8124}}} \right| = 23$). Database mushroom can be loaded into the matrix $\textbf{T}$ of size $8124 \times 23$. Using (3a) to investigate item support reveals that some items have few supports (e.g., two items $\{8\}$ and $\{12\}$ have $ Support\left( \{8\}\right) = Support\left( \{12\}\right) = 4$), while others have many supports (e.g., item $\{85\}$ has $ Support\left( \{85\}\right) = 8124$). This means that item $\left\{ 85\right\} $ appears in a total of 8124 transactions. The average support and median support are 1570.2 and 600, respectively. The average support is much greater than the median support because some items have a large number of supports that significantly impact the average value. We therefore use the median support given by (13a) as the support threshold instead of the average support. By applying a support threshold $\delta = 600$ based on the median support, a large number of items can be pruned from the mushroom database. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 59 items can be pre-pruned. This means that 60 items in database mushroom satisfy the support threshold and are considered for HO item set mining.

The results in Figure 8b indicate that the initial analysis finds 16470 distinguished items in database retail, i.e., $\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} $ and $\left| \textbf{I}\right| = 16470$. Database retail contains 88162 transactions, i.e., $\left| \textbf{T}\right| = 88162$. Unlike database mushroom, database retail has different lengths of transactions. Specifically, database retail contains 3116 transactions of one-item, while the transaction $\textbf{T}_{70925}$ contains 76 items (i.e., $\left| \textbf{T}_{70925} \right| = 76$). This means that some transactions are short and others are long. It is thus is difficult to load database retail as a matrix $\textbf{T}$. In this case, we find the longest transaction (i.e., $\left| \textbf{T}_{70925} \right| = 76$) and then add empty cells with “not a number” (NaN) for the other transactions so that they have the same length $\left| {{\mathbf{{T}}_1}} \right| = \ldots = \left| {{\mathbf{{T}}_{88162}}} \right| = 76$. Consequently, the transaction database $\textbf{T}$ has a size of $88162 \times 76$. MATLAB, however, recognizes all NaN values as distinguished items when the database $\textbf{T}$ is scanned to collect the distinguished item set $\textbf{I}$. All NaN items must therefore be trimmed to precisely obtain the distinguished item set $\textbf{I}$ such that $\textbf{I}$ contains only valid items (i.e., $\textbf{I} = \left\{ \{0\},\ldots ,\{16469\}\right\} $) belonging to the database $\textbf{T}$, i.e., all NaN values are trimmed from the distinguished item set $\textbf{I}$. Further investigation of the item supports shows 2224 items with only 1 support, while item $\left\{ 39\right\} $ has a large number of supports (i.e., $ Support\left( \left\{ 39\right\} \right) = 50675$). This means that some items have few supports while others have many. We then obtain the average support and median support values of 55.1655 and 11, respectively. The average support is greater than the median support because some items have many supports and affect the average value. Again, we proposes using the median support as a support threshold instead of the average support. Using expression (13a), we obtain the support threshold $\delta = 11$ for database retail. Applying this condition in line 9 of Algorithm 1 and in line 9 of Algorithm 2, 8229 items are pre-pruned. It means that there are 8241 items. This means that 8241 items in database retail satisfy the support threshold and are considered for HO item set mining.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tran, TN., Truong Hoang, V., Truong, TC. et al. A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold. Appl Intell 55, 205 (2025). https://doi.org/10.1007/s10489-024-06166-7

Download citation

Accepted: 06 December 2024
Published: 23 December 2024
DOI: https://doi.org/10.1007/s10489-024-06166-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hierarchical set-enumeration tree enabling high occupancy item set mining and the use of an adaptive occupancy threshold

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High Occupancy Itemset Mining with Consideration of Transaction Occupancy

Efficient algorithms to mine concise representations of frequent high utility occupancy patterns

HAUOPM: High Average Utility Occupancy Pattern Mining

References

Acknowledgements

Funding