Elsevier

Information Sciences

Volume 173, Issue 4, 23 June 2005, Pages 305-318
Information Sciences

Adapting the CBA algorithm by means of intensity of implication

https://doi.org/10.1016/j.ins.2004.03.022Get rights and content

Abstract

In recent years, extensive research has been carried out by using association rules to build more accurate classifiers. The idea behind these integrated approaches is to focus on a limited subset of association rules. This paper aims to contribute to this integrated framework by adapting the Classification Based on Associations (CBA) algorithm. CBA was adapted by coupling it with another measurement of the quality of association rules: i.e. intensity of implication. The new algorithm has been implemented and empirically tested on an authentic financial dataset for purposes of bankruptcy prediction. We validated our results with an association ruleset, with C4.5, with original CBA and with CART by statistically comparing its performance via the area under the ROC-curve. The adapted CBA algorithm presented in this paper proved to generate significantly better results than the other classifiers at the 5% level of significance.

Introduction

Classification and association-rule discovery are two of the most important tasks addressed in the data mining literature. Association rules have received significant attention for extracting knowledge from large databases. Their study is focused on using exhaustive search to find all rules in data that satisfy the user-specified minimum support and minimum confidence criteria. The Apriori algorithm is the best known algorithm in this field [2]. Probably, an even more popular technique is classification rule mining. It aims to discover a small set of rules to form an accurate classifier. Given a set of cases with class labels as a training set, the aim of classification is to build a model (called classifier) to predict future data objects for which the class label is unknown. Many systems have been proposed for classification rule mining but Quinlan’s C4.5 classifier [17] is known as the state-of-the-art method.

In recent years, extensive research has been carried out to integrate both approaches. By focusing on a limited subset of association rules, i.e. those rules where the consequent of the rule is restricted to the classification class attribute, it is possible to build more accurate classifiers. Several publications [5], [7], [12], [14], [20] have shown that association-based classification in general generates at least equal accuracy than state-of-the-art classification algorithms such as C4.5. The reasons for the good performance are obvious. Association rules will search globally for all rules that satisfy minimum support and minimum confidence norms. They will therefore contain the full set of rules, which may incorporate important information. The richness of the rules gives this technique the potential of reflecting the true classification structure in the data [20]. Associative classification is therefore gaining increasing popularity. However, the comprehensiveness and complexity of dealing with the often large number of association rules have lead to difficulties and (accuracy versus generality) trade-off questions which are part of a lot of research which is currently going on. Contributions to tackle a number of these difficulties can be found in [7], in [13] and in [20]. Liu et al., proposed an improvement of their original Classification Based on Associations (CBA)-system [14] in [15] to cope with the weaknesses in the system. In spite of the fact that the presented adaptations of CBA are valuable, some important issues still remain unsolved. Our goal is to address them in this paper.

The potential weakness which we were able to determine is situated in the way CBA sorts its (class) association rules. As will be explained in Section 2, the sorting in CBA is quite important because the rules for the final classifier will be selected by following the sorted sequence. CBA sorts its rules by using the conditional probability (confidence). This is a good measure when classes are equally distributed. However, as we will show, when class distributions differ significantly, and especially for classes whose frequency is low, this is not the most adequate approach to follow. For this reason, we propose intensity of implication [8] as a better measure to sort the class association rules. Apart from this, the CBA algorithm which we have implemented also traces the evolution of the number of false positives (FP) and false negatives (FN) and not only of the total number of errors, as the original CBA algorithm does. Section 3 elaborates on both issues. The results of our empirical evaluation are given in Section 4. Finally, conclusions and recommendations for further research are presented in Section 5.

Section snippets

Classification based on associations

Before elaborating on the changes which were made to CBA, a comprehensive overview of the original algorithm will be provided. First, we give a definition of association rules. Hereafter, class association rules (CARs) are introduced.

Limits of conditional probability (confidence)

A profound examination of the algorithm identified a potential weakness in the way the rules are sorted. Since rules are inserted in the classifier following the sorted confidence order, this will determine to a large extent the accuracy of our final classifier. Confidence is a good measure for the quality of (class) association rules but it also suffers from certain weaknesses. The aim of this section is to elaborate on them.

The first weakness is that the conditional probability of a rule X  Y

Description of the data

The training data being used for this study concerns a satisfaction survey that was conducted among customers of a major bank in Belgium in 1996. Nationwide, 7264 customers of the bank filled out a questionnaire. This questionnaire includes questions probing for the level of satisfaction with respect to specific service aspects of the bank, questions on socio-demographic characteristics of the customers and a question probing for the overall level of satisfaction. Customers were asked to

Conclusion

The algorithm presented in this paper is a modified version of the CBA algorithm, which can be used to build classifiers based on association rules. CBA was adapted by coupling it with intensity of implication, a measure to calculate the distance to random choices of small, even non statistically significant, subsets. The evolution of the number of FP and FN was also traced separately. As a result, the implementation was more transparent since the evolution of both types of errors could be

References (20)

  • A.P. Bradley

    The use of the area under the ROC curve in the evaluation of machine learning algorithms

    Pattern Recognition

    (1997)
  • R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proc. of...
  • R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. of the 20th International conference on...
  • T. Brijs, G. Swinnen, K. Vanhoof, G. Wets, Comparing complete and partial classification for identifying latenly...
  • J. Chen, H.Y. Liu, G.Q. Chen, Mining insightful classification rules directly and efficiently, in: IEEE/SMC’99, Vol. 3,...
  • E.R. DeLong et al.

    Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach

    Biometrics

    (1988)
  • G. Dong, X. Zhang, L. Wong, J. Li, CAEP: Classification by aggregating emerging patterns, in Proc. of the Second...
  • R. Gras et al.

    L’implication statistique: une nouvelle méthode d’analyse des données, Mathématiques

    Informatique et Sciences Humaines

    (1993)
  • S. Guillaume et al.

    Improving the discovery of association rules with intensity of implication, in Principles of Data Mining and Knowledge Discovery

    Lecture Notes in Artificial Intelligence

    (1998)
  • J.A. Hanley et al.

    A method of comparing the areas under receiver operating characteristic curves derived from the same cases

    Radiology

    (1983)
There are more references available in the full text version of this article.

Cited by (13)

  • The best of two worlds: Balancing model strength and comprehensibility in business failure prediction using spline-rule ensembles

    2017, Expert Systems with Applications
    Citation Excerpt :

    By far, the majority of methodological contributions in the business failure prediction literature has focused upon methods originating from the data mining and machine learning literature. In this category one can cite artificial neural networks (Atiya, 2001; Pendharkar, 2005), decision trees (Frydman, Altman, & Kao, 1985), support vector machines (Li & Sun, 2011a), Bayesian networks (Sun & Shenoy, 2007), rough sets (McKee, 2003), k-nearest neighbors (Park & Han, 2002), association rules (Janssens, Wets, Brijs, & Vanhoof, 2005) and finally, ensemble learners (Li & Sun, 2011b). A comprehensive review of statistical and data mining techniques used for business failure prediction can be found in Ravi Kumar and Ravi (2007).

  • Increasing the effectiveness of associative classification in terms of class imbalance by using a novel pruning algorithm

    2012, Expert Systems with Applications
    Citation Excerpt :

    According to the average, UCONF appears to perform the best at 0.9363, while CONF performs the worst at 0.9313. Therefore, in contrast with the findings of Janssens et al. (2005), this work adopts a more stringent method, which demonstrates that the prediction model constructed from the ranking index IOI performs better than CONF. Table 4 summarizes the AUC test results of DeLong et al. (1988) for the three ranking indices.

  • Adjusting and generalizing CBA algorithm to handling class imbalance

    2012, Expert Systems with Applications
    Citation Excerpt :

    However, this algorithm prunes rules by pessimistic error pruning (PEP) (Quinlan, 1992), and so many overlapping data cases that satisfy rules, raising the possibility of misevaluation in the scoring of a test set. Janssens et al. (2005) studied class imbalance classification and used a rule sorting index, “Intensity of implication”, to mitigate the problem that the positive class rules are eliminated from a classifier. Chen, Hsu, and Hsu (2010) proved that the sorting index slightly improves the rank of the positive class rules.

  • Principal component case-based reasoning ensemble for business failure prediction

    2011, Information and Management
    Citation Excerpt :

    Business failure prediction (BFP) techniques coded in these systems should obviously be accurate in order to avoid significant bank losses. Commonly used models today include multivariate discriminant analysis (MDA) [5], logistic regression (Logit) [6,9], neural network (NN) [18], case-based reasoning (CBR) [12], rough sets theory [1,14], Bayesian network [23], data envelopment analysis [2,17,19], association rules [8], and support vector machine (SVM) [4,7]. These models all predict business failure using a single predictive model.

View all citing articles on Scopus
View full text