Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures

Hämäläinen, Wilhelmiina

doi:10.1007/s10115-011-0432-2

Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures

Regular Paper
Published: 28 July 2011

Volume 32, pages 383–414, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wilhelmiina Hämäläinen¹

436 Accesses
35 Citations
Explore all metrics

Abstract

Statistical dependency analysis is the basis of all empirical science. A commonly occurring problem is to find the most significant dependency rules, which describe either positive or negative dependencies between categorical attributes. In medical science, for example, one is interested in genetic factors, which can either predispose or prevent diseases. The requirement of statistical significance is essential, because the discoveries should hold also in future data. Typically, the significance is estimated either by Fisher’s exact test or the χ ²-measure. The problem is computationally very difficult, because the number of all possible dependency rules increases exponentially with the number of attributes. As a solution, different kinds of restrictions and heuristics have been applied, but a general, scalable search method has been missing. In this paper, we introduce an efficient algorithm, called Kingfisher, for searching for the best non-redundant dependency rules with statistical significance measures. The rules can express either positive or negative dependencies between a set of positive attributes and a single consequent attribute. The algorithm itself is independent from the used goodness measure, but we concentrate on Fisher’s exact test and the χ ²-measure. The algorithm is based on an application of the branch-and-bound search strategy, supplemented by several pruning properties. Especially, we prove a new lower bound for Fisher’s p and introduce a new effective pruning principle. According to our experiments on classical benchmark data, the algorithm is well scalable and can efficiently handle even dense and high-dimensional data sets. An interesting observation was that Fisher’s exact test did not only produce more reliable rules than the χ ²-measure, but it also performed the search much faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering dependencies with reliable mutual information

Article Open access 24 July 2020

The randomized information coefficient: assessing dependencies in noisy data

Article 19 September 2017

Dependency relations

Article 02 January 2019

References

Aggarwal C, Yu P (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 1998). ACM Press, New York, pp 18–24
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data. ACM Press, New York, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB’94). Morgan Kaufmann, Los Altos, pp 487–499
Agresti A (1992) A survey of exact inference for contingency tables. Stat Sci 7(1): 131–153
Article MathSciNet MATH Google Scholar
Antonie M-L, Zaïane OR (2004) Mining positive and negative association rules: an approach for confined rules. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD’04). Springer, Berlin, pp 27–38
Blanchard J, Guillet F, Gras R, Briand H (2005) Using information-theoretic measures to assess association rule interestingness. In: Proceedings of the Fifth IEEE international conference on data mining (ICDM’05). IEEE Comput Soc, pp 66–73
Borgelt C (2010) Apriori v5.14 software. http://www.borgelt.net/apriori.html. Retrieved 7.6. 2010
Cormen T, Leiserson C, Rivest R (1990) Introduction to algorithms. The MIT Press, Cambridge
MATH Google Scholar
Fisher R (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh
Google Scholar
Hämäläinen W (2009) Lift-based search for significant dependencies in dense data sets. In: Proceedings of the workshop on statistical and relational learning in bioinformatics (StReBio’09), in the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD’09). ACM Press, New York, pp 12–16
Hämäläinen W (2010a) Efficient discovery of the top-K optimal dependency rules with Fisher’s exact test of significance. In: Proceedings of the 10th IEEE international conference on data mining (ICDM 2010). IEEE Computer Society, Wahington, pp 196–205
Hämäläinen W (2010b) Efficient search for statistically significant dependency rules in binary data. PhD thesis, Department of Computer Science, University of Helsinki, Finland. Series of Publications A, Report A-2010-2
Hämäläinen W (2010) Statapriori: an efficient algorithm for searching statistically significant association rules. Knowl Inf Syst Int J (KAIS) 23(3): 373–399
Article Google Scholar
Hämäläinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: Proceedings of the 8th IEEE international conference on data mining (ICDM’08), pp 203–212
Koh Y, Pears R (2007) Efficiently finding negative association rules without support threshold. In: Advances in artificial intelligence, proceedings of the 20th Australian joint conference on artificial intelligence (AI 2007), vol 4830 of lecture notes in computer cience. Springer, Berlin, pp 710–714
Koh Y, Rountree N, O’Keefe R (2008) Mining interesting imperfectly sporadic rules. Knowl Inf Syst 14(2): 179–196
Article Google Scholar
Li J (2006) On optimal rule discovery. IEEE Trans Knowl Data Eng 18(4): 460–471
Article Google Scholar
Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99). ACM Press, New York, pp 125–134
Morishita S, Sese J (2000) Transversing itemset lattices with statistical metric pruning. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’00). ACM Press, New York, pp 226–236
Nijssen S, Guns T, Raedt LD (2009) Correlated itemset mining in ROC space: a constraint programming approach. In: Proceedings the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD’09). ACM Press, New York, pp 647–656
Nijssen S, Kok J (2006) Multi-class correlated pattern mining. In: Proceedings of the 4th international workshop on knowledge discovery in inductive databases, vol 3933 of lecture notes in computer science. Springer, Berlin, pp 165–187
Thiruvady D, Webb G (2004) Mining negative rules using GRD. In: Advances in knowledge discovery and data mining, proceedings of the 8th Pacific-Asia conference (PAKDD 2004), vol 3056 of lecture notes in computer science. Springer, Berlin, pp 161–165
Webb G (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM Press, New York, pp 434–443
Webb G (2007) Discovering significant patterns. Mach Learn 68(1): 1–33
Article Google Scholar
Webb G (2008) Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach Learn 71(2–3): 307–323
Article Google Scholar
Webb G (n.d.) MagnumOpus software. http://www.giwebb.com/index.html. Retrieved 10.2. 2009
Webb G, Zhang S (2005) K-optimal rule discovery. Data Mining Knowl Discov 10(1): 39–79
Article MathSciNet Google Scholar
Wu X, Zhang C, Zhang S (2002) Mining both positive and negative association rules. In: Proceedings of the nineteenth international conference on machine learning (ICML ’02). Morgan Kaufmann Publishers Inc., San Francisco, pp 658–665
Wu X, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. ACM Trans Inf Syst 22(3): 381–405
Article Google Scholar
Xiong H, Shekhar S, Tan P-N, Kumar V (2004) Exploiting a support-based upper bound of Pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04). ACM Press, New York, pp 334–343
Yates F (1984) Test of significance for 2 × 2 contingency tables. J Roy Stat Soc Ser A (General) 147(3): 426–463
Article MathSciNet MATH Google Scholar
Zhang S, Wu X (2011) Fundamentals of association rules in data mining and knowledge discovery. Wiley Interdiscip Rev: Data Mining Knowl Discov 1(2): 97–116
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biosciences, University of Eastern Finland, Joensuu, Finland
Wilhelmiina Hämäläinen

Authors

Wilhelmiina Hämäläinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wilhelmiina Hämäläinen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hämäläinen, W. Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32, 383–414 (2012). https://doi.org/10.1007/s10115-011-0432-2

Download citation

Received: 28 January 2011
Revised: 03 May 2011
Accepted: 25 June 2011
Published: 28 July 2011
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10115-011-0432-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures

Abstract

Access this article

Similar content being viewed by others

Discovering dependencies with reliable mutual information

The randomized information coefficient: assessing dependencies in noisy data

Dependency relations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures

Abstract

Access this article

Similar content being viewed by others

Discovering dependencies with reliable mutual information

The randomized information coefficient: assessing dependencies in noisy data

Dependency relations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation