Abstract
Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not effective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving items which are from different support levels and are poorly correlated. In this paper, we present a framework for mining highly-correlated association patterns called hyperclique patterns. In this framework, an objective measure called h-confidence is applied to discover hyperclique patterns. We prove that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coefficient). Also, we show that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels. Indeed, this cross-support property is not limited to h-confidence and can be generalized to some other association measures. In addition, an algorithm called hyperclique miner is proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the efficient discovery of hyperclique patterns. Finally, our experimental results show that hyperclique miner can efficiently identify hyperclique patterns, even at extremely low levels of support.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
It is available at http://www.almaden.ibm.com/ software/quest/resources.
This is observed on Sun Ultra 10 work station with a 440 MHz CPU and 128 Mbytes of memory.
When computing Pearson's correlation coefficient, the data mean is not subtracted.
Note that the number of items shown in Table 4 for pumsb, pumsb * are somewhat different from the numbers reported in Zaki and Hsiao (2002), because we only consider item IDs for which the count is at least one. For example, although the minimum item ID in pumsb is 0 and the maximum item ID is 7116, there are only 2113 distinct item IDs that appear in the data set.
The data set is available at http://trec.nist.gov.
References
Agarwal R, Aggarwal C, Prasad V (2000) A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases
Bayardo R, Agrawal R, Gunopulous D (1999) Constraint-based rule mining in large, dense databases. In: Proceedings of the Int'l Conference on Data Engineering
Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: Generalizing association rules to correlations. In: Proceedings of ACM SIGMOD Int'l Conference on Management of Data. pp 265–276
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: A maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 2001 Int'l Conference on Data Engineering (ICDE)
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J, Yang C (2000) Finding interesting associations without support pruning. In: Proceedings of the 2000 Int'l Conference on Data Engineering (ICDE). pp 489–499
Feder T, Motwani R (1995) Clique partitions, graph compression and speeding-up algorithms. Special Issue for the STOC conference. Journal of Computer and System Sciences 51:261–272
Grahne G, Lakshmanan LVS, Wang X (2000) Efficient mining of constrained correlated sets. In: Proceedings of the Int'l Conference on Data Engineering
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD Int'l Conference on Management of Data
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: Data mining, inference, and prediction, Springer
Liu B, Hsu W, Ma Y (1999) Mining association rules with multiple minimum supports. In: Proceedings of the 1999 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Omiecinski E (2003) Alternative interest measures for mining associations. IEEE Transactions on Knowledge and Data Engineering 15(1)
Pei J, Han J, Mao R (2000) CLOSET: An efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
Reynolds HT (1977) The analysis of cross-classifications. The Free Press
Rijsbergen CJV (1979) Information retrieval, 2nd edn., Butterworths, London
Tan P, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of the 1999 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Wang K, He Y, Cheung D, Chin Y (2001) Mining confident rules without support requirement. In: Proceedings of the 2001 ACM International Conference on Information and Knowledge Management (CIKM)
Xiong H, He X, Ding C, Zhang Y, Kumar V, Holbrook S (2005) Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proceedings of the Pacific Symposium on Biocomputing (PSB)
Xiong H, Steinbach M, Tan P-N, Kumpar V (2004) HICAP: Hierarchial clustering with pattern preservation. In: Proceedings of 2004 SIAM Int'l Conference on Data Mining (SDM). pp 279–290
Xiong H, Tan P, Kumar V (2003a) Mining hyperclique patterns with confidence pruning. In: Technical Report 03-006, January, Department of computer science, University of Minnesota, Twin Cities
Xiong H, Tan P, Kumar V (2003b) Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the 3rd IEEE International Conference on Data Mining. pp 387–394
Yang C, Fayyad UM, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the 1999 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Zaki M, Hsiao C-J (2002) CHARM: An efficient algorithm for closed itemset mining. In: Proceedings of 2002 SIAM International Conference on Data Mining
Zipf G (1949) Human behavior and principle of least effort: An introudction to human ecology. Addison Wesley, Cambridge, Massachusetts
Acknowledgments
This work was partially supported by NSF grant # IIS-0308264, DOE/LLNL W-7045-ENG-48, and by Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAD19-01-2-0014. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute. Finally, we would like to thank Dr. Mohammed J. Zaki for providing us the CHARM code. Also, we would like to thank Dr. Shashi Shekhar, Dr. Ke Wang, and Michael Steinbach for valuable comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xiong, H., Tan, PN. & Kumar, V. Hyperclique pattern discovery. Data Min Knowl Disc 13, 219–242 (2006). https://doi.org/10.1007/s10618-006-0043-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-006-0043-9