Adapting the CBA algorithm by means of intensity of implication

doi:10.1016/j.ins.2004.03.022

Information Sciences

Volume 173, Issue 4, 23 June 2005, Pages 305-318

https://doi.org/10.1016/j.ins.2004.03.022 Get rights and content

Abstract

In recent years, extensive research has been carried out by using association rules to build more accurate classifiers. The idea behind these integrated approaches is to focus on a limited subset of association rules. This paper aims to contribute to this integrated framework by adapting the Classification Based on Associations (CBA) algorithm. CBA was adapted by coupling it with another measurement of the quality of association rules: i.e. intensity of implication. The new algorithm has been implemented and empirically tested on an authentic financial dataset for purposes of bankruptcy prediction. We validated our results with an association ruleset, with C4.5, with original CBA and with CART by statistically comparing its performance via the area under the ROC-curve. The adapted CBA algorithm presented in this paper proved to generate significantly better results than the other classifiers at the 5% level of significance.

Introduction

Classification and association-rule discovery are two of the most important tasks addressed in the data mining literature. Association rules have received significant attention for extracting knowledge from large databases. Their study is focused on using exhaustive search to find all rules in data that satisfy the user-specified minimum support and minimum confidence criteria. The Apriori algorithm is the best known algorithm in this field [2]. Probably, an even more popular technique is classification rule mining. It aims to discover a small set of rules to form an accurate classifier. Given a set of cases with class labels as a training set, the aim of classification is to build a model (called classifier) to predict future data objects for which the class label is unknown. Many systems have been proposed for classification rule mining but Quinlan’s C4.5 classifier [17] is known as the state-of-the-art method.

In recent years, extensive research has been carried out to integrate both approaches. By focusing on a limited subset of association rules, i.e. those rules where the consequent of the rule is restricted to the classification class attribute, it is possible to build more accurate classifiers. Several publications [5], [7], [12], [14], [20] have shown that association-based classification in general generates at least equal accuracy than state-of-the-art classification algorithms such as C4.5. The reasons for the good performance are obvious. Association rules will search globally for all rules that satisfy minimum support and minimum confidence norms. They will therefore contain the full set of rules, which may incorporate important information. The richness of the rules gives this technique the potential of reflecting the true classification structure in the data [20]. Associative classification is therefore gaining increasing popularity. However, the comprehensiveness and complexity of dealing with the often large number of association rules have lead to difficulties and (accuracy versus generality) trade-off questions which are part of a lot of research which is currently going on. Contributions to tackle a number of these difficulties can be found in [7], in [13] and in [20]. Liu et al., proposed an improvement of their original Classification Based on Associations (CBA)-system [14] in [15] to cope with the weaknesses in the system. In spite of the fact that the presented adaptations of CBA are valuable, some important issues still remain unsolved. Our goal is to address them in this paper.

The potential weakness which we were able to determine is situated in the way CBA sorts its (class) association rules. As will be explained in Section 2, the sorting in CBA is quite important because the rules for the final classifier will be selected by following the sorted sequence. CBA sorts its rules by using the conditional probability (confidence). This is a good measure when classes are equally distributed. However, as we will show, when class distributions differ significantly, and especially for classes whose frequency is low, this is not the most adequate approach to follow. For this reason, we propose intensity of implication [8] as a better measure to sort the class association rules. Apart from this, the CBA algorithm which we have implemented also traces the evolution of the number of false positives (FP) and false negatives (FN) and not only of the total number of errors, as the original CBA algorithm does. Section 3 elaborates on both issues. The results of our empirical evaluation are given in Section 4. Finally, conclusions and recommendations for further research are presented in Section 5.

Section snippets

Classification based on associations

Before elaborating on the changes which were made to CBA, a comprehensive overview of the original algorithm will be provided. First, we give a definition of association rules. Hereafter, class association rules (CARs) are introduced.

Limits of conditional probability (confidence)

A profound examination of the algorithm identified a potential weakness in the way the rules are sorted. Since rules are inserted in the classifier following the sorted confidence order, this will determine to a large extent the accuracy of our final classifier. Confidence is a good measure for the quality of (class) association rules but it also suffers from certain weaknesses. The aim of this section is to elaborate on them.

The first weakness is that the conditional probability of a rule X ⇒ Y

Description of the data

The training data being used for this study concerns a satisfaction survey that was conducted among customers of a major bank in Belgium in 1996. Nationwide, 7264 customers of the bank filled out a questionnaire. This questionnaire includes questions probing for the level of satisfaction with respect to specific service aspects of the bank, questions on socio-demographic characteristics of the customers and a question probing for the overall level of satisfaction. Customers were asked to

Conclusion

The algorithm presented in this paper is a modified version of the CBA algorithm, which can be used to build classifiers based on association rules. CBA was adapted by coupling it with intensity of implication, a measure to calculate the distance to random choices of small, even non statistically significant, subsets. The evolution of the number of FP and FN was also traced separately. As a result, the implementation was more transparent since the evolution of both types of errors could be

References (20)

A.P. Bradley
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition
(1997)
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proc. of...
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. of the 20th International conference on...
T. Brijs, G. Swinnen, K. Vanhoof, G. Wets, Comparing complete and partial classification for identifying latenly...
J. Chen, H.Y. Liu, G.Q. Chen, Mining insightful classification rules directly and efficiently, in: IEEE/SMC’99, Vol. 3,...
E.R. DeLong et al.
Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach
Biometrics
(1988)
G. Dong, X. Zhang, L. Wong, J. Li, CAEP: Classification by aggregating emerging patterns, in Proc. of the Second...
R. Gras et al.
L’implication statistique: une nouvelle méthode d’analyse des données, Mathématiques
Informatique et Sciences Humaines
(1993)
S. Guillaume et al.
Improving the discovery of association rules with intensity of implication, in Principles of Data Mining and Knowledge Discovery
Lecture Notes in Artificial Intelligence
(1998)
J.A. Hanley et al.
A method of comparing the areas under receiver operating characteristic curves derived from the same cases
Radiology
(1983)

There are more references available in the full text version of this article.

Cited by (13)

The best of two worlds: Balancing model strength and comprehensibility in business failure prediction using spline-rule ensembles
2017, Expert Systems with Applications
Citation Excerpt :
By far, the majority of methodological contributions in the business failure prediction literature has focused upon methods originating from the data mining and machine learning literature. In this category one can cite artificial neural networks (Atiya, 2001; Pendharkar, 2005), decision trees (Frydman, Altman, & Kao, 1985), support vector machines (Li & Sun, 2011a), Bayesian networks (Sun & Shenoy, 2007), rough sets (McKee, 2003), k-nearest neighbors (Park & Han, 2002), association rules (Janssens, Wets, Brijs, & Vanhoof, 2005) and finally, ensemble learners (Li & Sun, 2011b). A comprehensive review of statistical and data mining techniques used for business failure prediction can be found in Ravi Kumar and Ravi (2007).
Numerous organizations and companies rely upon business failure prediction to assess and minimize the risk of initiating business relationships with partners, clients, debtors or suppliers. Advances in research on business failure prediction have been largely dominated by algorithmic development and comparisons led by a focus on improvements in model accuracy. In this context, ensemble learning has recently emerged as a class of particularly well-performing methods, albeit often at the expense of increased model complexity. However, in practice, model choice is rarely based on predictive performance alone. Models should be comprehensible and justifiable to assess their compliance with common sense and business logic, and guarantee their acceptance throughout the organization. A promising ensemble classification algorithm that has been shown to reconcile performance and comprehensibility are rule ensembles. In this study, an extension entitled spline-rule ensembles is introduced and validated in the domain of business failure prediction. Spline-rule ensemble complement rules and linear terms found in conventional rule ensembles with smooth functions with the aim of better accommodating nonlinear simple effects of individual features on business failure. Experiments on a large selection of 21 datasets of European companies in various sectors and countries (i) demonstrate superior predictive performance of spline-rule ensembles over a set of well-established yet powerful benchmark methods, (ii) show the superiority of spline-rule ensembles over conventional rule ensembles and thus demonstrate the value of the incorporation of smoothing splines, (iii) investigate the impact of alternative term regularization procedures and (iv) illustrate the comprehensibility of the resulting models through a case study. In particular, the ability of the technique to reveal the extent and the way in which predictors impact business failure, and if and how variables interact, are exemplified.
Increasing the effectiveness of associative classification in terms of class imbalance by using a novel pruning algorithm
2012, Expert Systems with Applications
Citation Excerpt :
According to the average, UCONF appears to perform the best at 0.9363, while CONF performs the worst at 0.9313. Therefore, in contrast with the findings of Janssens et al. (2005), this work adopts a more stringent method, which demonstrates that the prediction model constructed from the ranking index IOI performs better than CONF. Table 4 summarizes the AUC test results of DeLong et al. (1988) for the three ranking indices.
Having received considerable interest in recent years, associative classification has focused on developing a class classifier, with lesser attention paid to the probability classifier used in direct marketing. While contributing to this integrated framework, this work attempts to increase the prediction accuracy of associative classification on class imbalance by adapting the scoring based on associations (SBA) algorithm. The SBA algorithm is modified by coupling it with the pruning strategy of association rules in the probabilistic classification based on associations (PCBA) algorithm, which is adjusted from the CBA for use in the structure of the probability classifier. PCBA is adjusted from CBA by increasing the confidence through under-sampling, setting different minimum supports (minsups) and minimum confidences (minconfs) for rules of different classes based on each distribution, and removing the pruning rules of the lowest error rate. Experimental results based on benchmark datasets and real-life application datasets indicate that the proposed method performs better than C5.0 and the original SBA do, and the number of rules required for scoring is significantly reduced.
Adjusting and generalizing CBA algorithm to handling class imbalance
2012, Expert Systems with Applications
Citation Excerpt :
However, this algorithm prunes rules by pessimistic error pruning (PEP) (Quinlan, 1992), and so many overlapping data cases that satisfy rules, raising the possibility of misevaluation in the scoring of a test set. Janssens et al. (2005) studied class imbalance classification and used a rule sorting index, “Intensity of implication”, to mitigate the problem that the positive class rules are eliminated from a classifier. Chen, Hsu, and Hsu (2010) proved that the sorting index slightly improves the rank of the positive class rules.
Associative classification has attracted substantial interest in recent years and been shown to yield good results. However, research in this field tends to focus on the development of class classifiers, but the required probability classifier of imbalance data has not been addressed comprehensively. This investigation presents a new associative classification method called Probabilistic Classification based on Association Rules (PCAR). PCAR is based on modifying the rule sorting index, the pruning method, and the scoring procedure in the CBA algorithm. CBA can be generalized to construct a probability classifier. Additionally, it can improve the efficiency of associative classification for predicting imbalance data. Experiments that use both benchmarking datasets and real-life application datasets reveal that the new method outperforms the previous associative classification algorithm and C5.0 for all datasets. Also, in some datasets, the predictive performance exceeds that achieved by logistic regression and the use of a neural network.
Principal component case-based reasoning ensemble for business failure prediction
2011, Information and Management
Citation Excerpt :
Business failure prediction (BFP) techniques coded in these systems should obviously be accurate in order to avoid significant bank losses. Commonly used models today include multivariate discriminant analysis (MDA) [5], logistic regression (Logit) [6,9], neural network (NN) [18], case-based reasoning (CBR) [12], rough sets theory [1,14], Bayesian network [23], data envelopment analysis [2,17,19], association rules [8], and support vector machine (SVM) [4,7]. These models all predict business failure using a single predictive model.
Case-based reasoning (CBR) has several advantages for business failure prediction (BFP), including ease of understanding, explanation, and implementation and the ability to make suggestions on how to avoid failure. We constructed a new ensemble method of CBR that we termed principal component CBR ensemble (PC-CBR-E): it, was intended to improve the predictive ability of CBR in BFP by integrating the feature selection methods in the representation level, a hybrid of principal component analysis with its two classical CBR algorithms at the modeling level and weighted majority voting at the ensemble level. We statistically validated our method by comparing it with other methods, including the best base model, multivariate discriminant analysis, logistic regression, and the two classical CBR algorithms. The results from a one-tailed significance test indicated that PC-CBR-E produced superior predictive performance in Chinese short-term and medium-term BFP.
Integrating induction and deduction for noisy data mining
2010, Information Sciences
Data mining research has been drawing a lot of interest and attention from various fields since late 1980s. The rapid progress has been achieved from three aspects: the prosperity of data mining conferences, the significant number of data mining algorithms, and widely applied areas of data mining techniques. With the continuing growth of the data volumes in many domains, the need of employing data mining techniques provides not only new opportunities but also immense challenges. In this article, we present our study on a challenging topic – integrating induction and deduction for noisy data mining. In particular, we assume the mechanism that corrupts the input data is a set of structured knowledge in the form of Associative Corruption (AC) rules. We apply deductive reasoning to generate the noise corruption rules; make error corrections on the input data with the help of these rules; and perform inductive learning from the corrected input data. Our experimental results show that the proposed integration framework is effective.
An association-based case reduction technique for case-based reasoning
2008, Information Sciences
Case-based reasoning (CBR) is a type of problem solving technique which uses previous cases to solve new, unseen and different problems. Although a larger number of cases in the memory can improve the coverage of the problem space, the retrieval efficiency will be downgraded if the size of the case-base grows to an unacceptable level. In CBR systems, the tradeoff between the number of cases stored in the case-base and the retrieval efficiency is a critical issue. This paper addresses the problem of case-base maintenance by developing a new technique, the association-based case reduction technique (ACRT), to reduce the size of the case-base in order to enhance the efficiency while maintaining or even improving the accuracy of the CBR. The experiments on 12 UCI datasets and an actual case from Taiwan’s hospital have shown superior generalization accuracy for CBR with ACRT (CBR–ACRT) as well as a greater solving efficiency.

View all citing articles on Scopus

View full text

Adapting the CBA algorithm by means of intensity of implication

Abstract

Introduction

Section snippets

Classification based on associations

Limits of conditional probability (confidence)

Description of the data

Conclusion

Pattern Recognition

Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach

Biometrics

L’implication statistique: une nouvelle méthode d’analyse des données, Mathématiques

Informatique et Sciences Humaines

Improving the discovery of association rules with intensity of implication, in Principles of Data Mining and Knowledge Discovery

Lecture Notes in Artificial Intelligence

A method of comparing the areas under receiver operating characteristic curves derived from the same cases

Radiology