StatApriori: an efficient algorithm for searching statistically significant association rules

Hämäläinen, Wilhelmiina

doi:10.1007/s10115-009-0229-8

StatApriori: an efficient algorithm for searching statistically significant association rules

Regular Paper
Published: 21 July 2009

Volume 23, pages 373–399, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wilhelmiina Hämäläinen¹

294 Accesses
Explore all metrics

Abstract

Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive. The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper, we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Empirical experiments have shown that StatApriori is very efficient, but in the same time it finds good quality rules.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C, Yu P (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 1998). ACM Press, New York, pp 18–24
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S, (eds) Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the twentieth international conference on very large data bases. VLDB’94, Morgan Kaufmann, Menlo Park, pp 487–499
Agresti A, Min Y (2005) Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables. Biometrics 61: 515–523
Article MATH MathSciNet Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of the first international conference on computational logic (CL’00). Lecture notes in computer science, vol 1861. Springer, Berlin, pp 972–986
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57(1): 289–300
MATH MathSciNet Google Scholar
Berzal F, Blanco I, Sánchez D, Miranda MAV (2001) A new framework to assess association rules. In: Proceedings of the fourth international conference on advances in intelligent data analysis (IDA’01). Lecture notes in computer science, vol 2189. Springer, London, pp 95–104
Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of the fifteenth conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) Proceedings ACM SIGMOD international conference on management of data. ACM Press, New York, pp 265–276
Carriere K (2001) How good is a normal approximation for rates and proportions of low incidence events?. Commun Stat Simul Comput 30: 327–337
Article MATH MathSciNet Google Scholar
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J, Yang C (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1): 64–78
Article Google Scholar
Freedman D, Pisani R, Purves R (2007) Statistics, 4th edn. Norton & Company, London
Google Scholar
Frequent Itemset Mining Dataset Repository (2009) Retrieved 10.2. 2009. http://fimi.cs.helsinki.fi/data/
Fujiwara S, Ullman J, Motwani R (2000) Dynamic miss-counting algorithms: finding implication and similarity rules with confidence pruning. In: Proceedings of the 16th international conference on data engineering (ICDE’00). IEEE Computer Society, pp 501–511
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14:1–14:32
Article Google Scholar
Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: Proceedings of the eighth IEEE international conference on data mining (ICDM 2008), pp 203–212
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58: 13–30
Article MATH MathSciNet Google Scholar
Jermaine C (2005) Finding the most interesting correlations in a database: how hard can it be?. Inf Syst 30(1): 21–46
Article Google Scholar
Koh Y, Pears R (2007) Efficiently finding negative association rules without support threshold. In: Advances in artificial intelligence. Proceedings of the twentieth Australian joint conference on artificial intelligence (AI 2007). Lecture notes in computer science, vol 4830. Springer, Berlin, pp 710–714
Koh Y, Rountree N, O’Keefe R (2008) Mining interesting imperfectly sporadic rules. Knowl Inf Syst 14(2): 179–196
Article Google Scholar
Koh YS (2008) Mining non-coincidental rules without a user defined support threshold. In: Advances in knowledge discovery and data mining. Proceedings of the twelfth Pacific–Asia conference (PAKDD 2008), vol 5012. Springer, Berlin, pp 910–915
Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4): 460–471
Article Google Scholar
Lindgren B (1993) Statistical theory, 4th edn. Chapman & Hall, Boca Raton
Google Scholar
Liu G, Li J, Wong L (2008) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst 17(1): 35–56
Article MathSciNet Google Scholar
Mannila H, Toivonen H, Verkamo A (1994) Efficient algorithms for discovering association rules. In: Papers from the AAAI workshop on knowledge discovery in databases (KDD’94), AAAI Press, pp 181–192
Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406
Article Google Scholar
Morishita S, Nakaya A (2000) Parallel branch-and-bound graph search for correlated association rules. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel kdd systems, SIGKDD. Lecture notes in computer science, vol 1759. Springer, Berlin, pp 127–144
Morishita S, Sese J (2000) Transversing itemset lattices with statistical metric pruning. In: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’00). ACM Press, New York, pp 226–236
Mozina M, Demsar J, Zabkar J, Bratko I (2006) Why is rule learning optimistic and how to correct it. In: Frnkranz J, Scheffer T, Spiliopoulou M (eds) Proceedings of the seventeenth European conference on machine learning (ECML’06). Lecture notes in computer science, vol 4212. Springer, Berlin, pp 330–340
Nijssen S, Kok J (2006) Multi-class correlated pattern mining. In: Proceedings of the fourth international workshop on knowledge discovery in inductive databases. Lecture notes in computer science, vol 3933. Springer, Berlin, pp 165–187
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley W (eds) Knowledge discovery in databases. AAAI/MIT Press, Cambridge, pp 229–248
Shaffer J (1995) Multiple hypothesis testing. Ann Rev Psychol 46: 561–584
Article Google Scholar
Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: generalizing association rules to dependence rules. Data Min Knowl Discov 2(1): 39–68
Article Google Scholar
Smyth P, Goodman R (1992) An information theoretic approach to rule induction from databases. IEEE Trans Knowl Data Eng 4(4): 301–316
Article Google Scholar
Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4): 293–313
Article Google Scholar
The PLANTS Database (2008) Retrieved 31 December 2008. http://plants.usda.gov
Wang K, He Y, Cheung D (2001) Mining confident rules without support requirement. In: Proceedings of the tenth international conference on Information and knowledge management (CIKM1). ACM Press, New York, pp 89–96
Wang K, Zhou S, He Y (2000) Growing decision trees on support-less association rules. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’00). ACM Press, New York, pp 265–269
Webb G (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM Press, New York, pp 434–443
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1): 1–33
Article Google Scholar
Yen S-J, Chen A (1996) An efficient approach to discovering knowledge from large databases. In: Proceedings of the fourth international conference on on Parallel and distributed information systems (DIS’96). IEEE Computer Society, Washington, DC, USA, pp 8–18

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Finland, Europe
Wilhelmiina Hämäläinen

Authors

Wilhelmiina Hämäläinen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wilhelmiina Hämäläinen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hämäläinen, W. StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl Inf Syst 23, 373–399 (2010). https://doi.org/10.1007/s10115-009-0229-8

Download citation

Received: 16 January 2009
Revised: 21 April 2009
Accepted: 09 May 2009
Published: 21 July 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10115-009-0229-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

StatApriori: an efficient algorithm for searching statistically significant association rules

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Association Rule Learning

Sets of Robust Rules, and How to Find Them

Enhanced Association Rules and Python

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

StatApriori: an efficient algorithm for searching statistically significant association rules

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Association Rule Learning

Sets of Robust Rules, and How to Find Them

Enhanced Association Rules and Python

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now