Skip to main content
Log in

StatApriori: an efficient algorithm for searching statistically significant association rules

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper, we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Empirical experiments have shown that StatApriori is very efficient, but in the same time it finds good quality rules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Yu P (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 1998). ACM Press, New York, pp 18–24

  2. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S, (eds) Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216

  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the twentieth international conference on very large data bases. VLDB’94, Morgan Kaufmann, Menlo Park, pp 487–499

  4. Agresti A, Min Y (2005) Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables. Biometrics 61: 515–523

    Article  MATH  MathSciNet  Google Scholar 

  5. Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  6. Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of the first international conference on computational logic (CL’00). Lecture notes in computer science, vol 1861. Springer, Berlin, pp 972–986

  7. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57(1): 289–300

    MATH  MathSciNet  Google Scholar 

  8. Berzal F, Blanco I, Sánchez D, Miranda MAV (2001) A new framework to assess association rules. In: Proceedings of the fourth international conference on advances in intelligent data analysis (IDA’01). Lecture notes in computer science, vol 2189. Springer, London, pp 95–104

  9. Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of the fifteenth conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg

  10. Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) Proceedings ACM SIGMOD international conference on management of data. ACM Press, New York, pp 265–276

  11. Carriere K (2001) How good is a normal approximation for rates and proportions of low incidence events?. Commun Stat Simul Comput 30: 327–337

    Article  MATH  MathSciNet  Google Scholar 

  12. Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J, Yang C (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1): 64–78

    Article  Google Scholar 

  13. Freedman D, Pisani R, Purves R (2007) Statistics, 4th edn. Norton & Company, London

    Google Scholar 

  14. Frequent Itemset Mining Dataset Repository (2009) Retrieved 10.2. 2009. http://fimi.cs.helsinki.fi/data/

  15. Fujiwara S, Ullman J, Motwani R (2000) Dynamic miss-counting algorithms: finding implication and similarity rules with confidence pruning. In: Proceedings of the 16th international conference on data engineering (ICDE’00). IEEE Computer Society, pp 501–511

  16. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14:1–14:32

    Article  Google Scholar 

  17. Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: Proceedings of the eighth IEEE international conference on data mining (ICDM 2008), pp 203–212

  18. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58: 13–30

    Article  MATH  MathSciNet  Google Scholar 

  19. Jermaine C (2005) Finding the most interesting correlations in a database: how hard can it be?. Inf Syst 30(1): 21–46

    Article  Google Scholar 

  20. Koh Y, Pears R (2007) Efficiently finding negative association rules without support threshold. In: Advances in artificial intelligence. Proceedings of the twentieth Australian joint conference on artificial intelligence (AI 2007). Lecture notes in computer science, vol 4830. Springer, Berlin, pp 710–714

  21. Koh Y, Rountree N, O’Keefe R (2008) Mining interesting imperfectly sporadic rules. Knowl Inf Syst 14(2): 179–196

    Article  Google Scholar 

  22. Koh YS (2008) Mining non-coincidental rules without a user defined support threshold. In: Advances in knowledge discovery and data mining. Proceedings of the twelfth Pacific–Asia conference (PAKDD 2008), vol 5012. Springer, Berlin, pp 910–915

  23. Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4): 460–471

    Article  Google Scholar 

  24. Lindgren B (1993) Statistical theory, 4th edn. Chapman & Hall, Boca Raton

    Google Scholar 

  25. Liu G, Li J, Wong L (2008) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst 17(1): 35–56

    Article  MathSciNet  Google Scholar 

  26. Mannila H, Toivonen H, Verkamo A (1994) Efficient algorithms for discovering association rules. In: Papers from the AAAI workshop on knowledge discovery in databases (KDD’94), AAAI Press, pp 181–192

  27. Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406

    Article  Google Scholar 

  28. Morishita S, Nakaya A (2000) Parallel branch-and-bound graph search for correlated association rules. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel kdd systems, SIGKDD. Lecture notes in computer science, vol 1759. Springer, Berlin, pp 127–144

  29. Morishita S, Sese J (2000) Transversing itemset lattices with statistical metric pruning. In: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’00). ACM Press, New York, pp 226–236

  30. Mozina M, Demsar J, Zabkar J, Bratko I (2006) Why is rule learning optimistic and how to correct it. In: Frnkranz J, Scheffer T, Spiliopoulou M (eds) Proceedings of the seventeenth European conference on machine learning (ECML’06). Lecture notes in computer science, vol 4212. Springer, Berlin, pp 330–340

  31. Nijssen S, Kok J (2006) Multi-class correlated pattern mining. In: Proceedings of the fourth international workshop on knowledge discovery in inductive databases. Lecture notes in computer science, vol 3933. Springer, Berlin, pp 165–187

  32. Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley W (eds) Knowledge discovery in databases. AAAI/MIT Press, Cambridge, pp 229–248

  33. Shaffer J (1995) Multiple hypothesis testing. Ann Rev Psychol 46: 561–584

    Article  Google Scholar 

  34. Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: generalizing association rules to dependence rules. Data Min Knowl Discov 2(1): 39–68

    Article  Google Scholar 

  35. Smyth P, Goodman R (1992) An information theoretic approach to rule induction from databases. IEEE Trans Knowl Data Eng 4(4): 301–316

    Article  Google Scholar 

  36. Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4): 293–313

    Article  Google Scholar 

  37. The PLANTS Database (2008) Retrieved 31 December 2008. http://plants.usda.gov

  38. Wang K, He Y, Cheung D (2001) Mining confident rules without support requirement. In: Proceedings of the tenth international conference on Information and knowledge management (CIKM1). ACM Press, New York, pp 89–96

  39. Wang K, Zhou S, He Y (2000) Growing decision trees on support-less association rules. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’00). ACM Press, New York, pp 265–269

  40. Webb G (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM Press, New York, pp 434–443

  41. Webb GI (2007) Discovering significant patterns. Mach Learn 68(1): 1–33

    Article  Google Scholar 

  42. Yen S-J, Chen A (1996) An efficient approach to discovering knowledge from large databases. In: Proceedings of the fourth international conference on on Parallel and distributed information systems (DIS’96). IEEE Computer Society, Washington, DC, USA, pp 8–18

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wilhelmiina Hämäläinen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hämäläinen, W. StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl Inf Syst 23, 373–399 (2010). https://doi.org/10.1007/s10115-009-0229-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0229-8

Keywords

Navigation