Abstract
Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper, we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Empirical experiments have shown that StatApriori is very efficient, but in the same time it finds good quality rules.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Aggarwal C, Yu P (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 1998). ACM Press, New York, pp 18–24
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S, (eds) Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the twentieth international conference on very large data bases. VLDB’94, Morgan Kaufmann, Menlo Park, pp 487–499
Agresti A, Min Y (2005) Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables. Biometrics 61: 515–523
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of the first international conference on computational logic (CL’00). Lecture notes in computer science, vol 1861. Springer, Berlin, pp 972–986
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57(1): 289–300
Berzal F, Blanco I, Sánchez D, Miranda MAV (2001) A new framework to assess association rules. In: Proceedings of the fourth international conference on advances in intelligent data analysis (IDA’01). Lecture notes in computer science, vol 2189. Springer, London, pp 95–104
Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of the fifteenth conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) Proceedings ACM SIGMOD international conference on management of data. ACM Press, New York, pp 265–276
Carriere K (2001) How good is a normal approximation for rates and proportions of low incidence events?. Commun Stat Simul Comput 30: 327–337
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J, Yang C (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1): 64–78
Freedman D, Pisani R, Purves R (2007) Statistics, 4th edn. Norton & Company, London
Frequent Itemset Mining Dataset Repository (2009) Retrieved 10.2. 2009. http://fimi.cs.helsinki.fi/data/
Fujiwara S, Ullman J, Motwani R (2000) Dynamic miss-counting algorithms: finding implication and similarity rules with confidence pruning. In: Proceedings of the 16th international conference on data engineering (ICDE’00). IEEE Computer Society, pp 501–511
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14:1–14:32
Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: Proceedings of the eighth IEEE international conference on data mining (ICDM 2008), pp 203–212
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58: 13–30
Jermaine C (2005) Finding the most interesting correlations in a database: how hard can it be?. Inf Syst 30(1): 21–46
Koh Y, Pears R (2007) Efficiently finding negative association rules without support threshold. In: Advances in artificial intelligence. Proceedings of the twentieth Australian joint conference on artificial intelligence (AI 2007). Lecture notes in computer science, vol 4830. Springer, Berlin, pp 710–714
Koh Y, Rountree N, O’Keefe R (2008) Mining interesting imperfectly sporadic rules. Knowl Inf Syst 14(2): 179–196
Koh YS (2008) Mining non-coincidental rules without a user defined support threshold. In: Advances in knowledge discovery and data mining. Proceedings of the twelfth Pacific–Asia conference (PAKDD 2008), vol 5012. Springer, Berlin, pp 910–915
Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4): 460–471
Lindgren B (1993) Statistical theory, 4th edn. Chapman & Hall, Boca Raton
Liu G, Li J, Wong L (2008) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst 17(1): 35–56
Mannila H, Toivonen H, Verkamo A (1994) Efficient algorithms for discovering association rules. In: Papers from the AAAI workshop on knowledge discovery in databases (KDD’94), AAAI Press, pp 181–192
Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406
Morishita S, Nakaya A (2000) Parallel branch-and-bound graph search for correlated association rules. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel kdd systems, SIGKDD. Lecture notes in computer science, vol 1759. Springer, Berlin, pp 127–144
Morishita S, Sese J (2000) Transversing itemset lattices with statistical metric pruning. In: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’00). ACM Press, New York, pp 226–236
Mozina M, Demsar J, Zabkar J, Bratko I (2006) Why is rule learning optimistic and how to correct it. In: Frnkranz J, Scheffer T, Spiliopoulou M (eds) Proceedings of the seventeenth European conference on machine learning (ECML’06). Lecture notes in computer science, vol 4212. Springer, Berlin, pp 330–340
Nijssen S, Kok J (2006) Multi-class correlated pattern mining. In: Proceedings of the fourth international workshop on knowledge discovery in inductive databases. Lecture notes in computer science, vol 3933. Springer, Berlin, pp 165–187
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley W (eds) Knowledge discovery in databases. AAAI/MIT Press, Cambridge, pp 229–248
Shaffer J (1995) Multiple hypothesis testing. Ann Rev Psychol 46: 561–584
Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: generalizing association rules to dependence rules. Data Min Knowl Discov 2(1): 39–68
Smyth P, Goodman R (1992) An information theoretic approach to rule induction from databases. IEEE Trans Knowl Data Eng 4(4): 301–316
Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4): 293–313
The PLANTS Database (2008) Retrieved 31 December 2008. http://plants.usda.gov
Wang K, He Y, Cheung D (2001) Mining confident rules without support requirement. In: Proceedings of the tenth international conference on Information and knowledge management (CIKM1). ACM Press, New York, pp 89–96
Wang K, Zhou S, He Y (2000) Growing decision trees on support-less association rules. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’00). ACM Press, New York, pp 265–269
Webb G (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM Press, New York, pp 434–443
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1): 1–33
Yen S-J, Chen A (1996) An efficient approach to discovering knowledge from large databases. In: Proceedings of the fourth international conference on on Parallel and distributed information systems (DIS’96). IEEE Computer Society, Washington, DC, USA, pp 8–18
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hämäläinen, W. StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl Inf Syst 23, 373–399 (2010). https://doi.org/10.1007/s10115-009-0229-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0229-8