Speeding up maximal fully-correlated itemsets search in large databases

Duan, Lian; Street, W. Nick

doi:10.1007/s13042-014-0290-9

Speeding up maximal fully-correlated itemsets search in large databases

Original Article
Published: 10 August 2014

Volume 7, pages 741–751, (2016)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Lian Duan¹ &
W. Nick Street²

191 Accesses
1 Citation
Explore all metrics

Abstract

Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Our previous work on the maximal fully-correlated itemset (MFCI) framework can rule out the itemsets with irrelevant items and its downward-closed property helps to achieve good computational performance. However, to calculate the desired MFCIs in large databases, there are still two computational issues. First, unlike finding maximal frequent itemsets which can start the pruning from 1-itemsets, finding MFCIs must start the pruning from 2-itemsets. When the number of items in a given dataset is large and the support of all the pairs cannot be loaded into the memory, the IO cost (\(O(n^2)\)) for calculating correlation of all the pairs can be very high. Second, users usually need to try different correlation thresholds for different desirable MFCIs. Therefore, the cost of processing the Apriori procedure each time for a different correlation threshold is also very high. Consequently, we proposed two techniques to solve these problems. First, we identify the correlation upper bound for any good correlation measure to avoid unnecessary IO query for the support of pairs, and make use of their common monotone property to prune many pairs even without computing their correlation upper bounds. In addition, we build an enumeration tree to save the fully-correlated value for all the MFCIs under a given initial correlation threshold. We can either efficiently retrieve the desired MFCIs for any given threshold above the initial threshold or incrementally grow the tree if the given threshold is below the initial threshold. Experimental results show that our algorithm can be an order of magnitude faster than the original MFCI algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 207–216
Bayardo RJ, Jr. (1998) Efficiently mining long patterns from databases. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 85–93
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD ’97: Proceedings ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD ’97: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 255–264
Burdick D (2001) Mafia: A maximal frequent itemset algorithm for transactional databases. In: ICDE ’01: Proceedings of the 17th international conference on data engineering. IEEE Computer Society, Washington, DC, p 443
Duan L, Khoshneshin M, Street W, Liu M (2013) Adverse drug effect detection. IEEE J Biomed Health Inf 17(2):305–311
Article Google Scholar
Duan L, Street WN (2009) Finding maximal fully-correlated itemsets in large databases. In: ICDM ’09: Proceedings of the 9th international conference on data mining. IEEE Computer Society, Miami, pp 770–775
Duan L, Street WN, Liu Y (2013) Speeding up correlation search for binary data. Pattern Recognit Lett 34(13):1499–1507
Article Google Scholar
Duan L, Street WN, Liu Y, Lu H (2014) Community detection in graphs through correlation. In: The 20th ACM SIGKDD conference on knowledge discovery and data mining (accepted)
Duan L, Street WN, Liu Y, Xu S, Wu B (2014) Selecting the right correlation measure for binary data. ACM transactions on knowledge discovery from data (accepted)
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
Google Scholar
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: A survey. ACM Comput Surv 38(3):9
Article Google Scholar
Gouda K, Zaki MJ (2005) Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Min Knowl Discov 11(3):223–242
Article MathSciNet Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on management of data. ACM, New York, pp 1–12
Jermaine C (2005) Finding the most interesting correlations in a database: How hard can it be? Inf Syst 30(1):21–46
Article Google Scholar
Liu M, Hinz ERM, Matheny ME, Denny JC, Schildcrout JS, Miller RA, Xu H (2013) Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records. J Am Med Inform Assoc 20(3):420–426
Article Google Scholar
Mohamed MH, Darwieesh MM (2013) Efficient mining frequent itemsets algorithms. Int J Mach Learn Cybern. doi:10.1007/s13042-013-0172-6
Omiecinski ER (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1):57–69
Article MathSciNet Google Scholar
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. AAAI/MIT Press, Cambridge
Google Scholar
Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4):293–313
Article Google Scholar
Tew C, Giraud-Carrier C, Tanner K, Burton S (2014) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045
Vo B, Le T, Coenen F, Hong T-P (2014) Mining frequent itemsets using the n-list and subsume concepts. Int J Mach Learn Cybern. doi:10.1007/s13042-014-0252-2
Zhou W, Zhang H (2013) Correlation range query for effective recommendations. World Wide Web. doi:10.1007/s11280-013-0265-x

Download references

Author information

Authors and Affiliations

New Jersey Institute of Technology, Newark, NJ, 07102, USA
Lian Duan
The University of Iowa, Iowa City, IA, 52242, USA
W. Nick Street

Authors

Lian Duan
View author publications
You can also search for this author in PubMed Google Scholar
W. Nick Street
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lian Duan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, L., Street, W.N. Speeding up maximal fully-correlated itemsets search in large databases. Int. J. Mach. Learn. & Cyber. 7, 741–751 (2016). https://doi.org/10.1007/s13042-014-0290-9

Download citation

Received: 16 January 2014
Accepted: 26 July 2014
Published: 10 August 2014
Issue Date: October 2016
DOI: https://doi.org/10.1007/s13042-014-0290-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speeding up maximal fully-correlated itemsets search in large databases

Abstract

Access this article

Similar content being viewed by others

FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning

A high utility itemset mining algorithm based on subsume index

Efficiently mining frequent itemsets with weight and recency constraints

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speeding up maximal fully-correlated itemsets search in large databases

Abstract

Access this article

Similar content being viewed by others

FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning

A high utility itemset mining algorithm based on subsume index

Efficiently mining frequent itemsets with weight and recency constraints

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation