Analysis of Credal-C4.5 for classification in noisy domains
Introduction
A Decision Tree (DT) is a very useful tool for classification. Its structure is simple and easy to interpret. Moreover, to build the classification model normally requires a short time. The ID3 algorithm (Quinlan, 1986) and its extension C4.5 (Quinlan, 1993) are widely used for designing decision trees.
In the last years, several mathematical models based on imprecise probabilities have been developed for representing the information (Walley, 1996, Wang, 2010, Weichselberger, 2000). By using the theory of imprecise probabilities presented in Walley (1996), known as the Imprecise Dirichlet Model (IDM), Abellán and Moral (2003) have developed an algorithm for designing decision trees, called credal decision trees (CDTs).1 The split criterion for this algorithm is based on imprecise probabilities and uncertainty measures on credal sets, i.e. closed and convex sets of probability distributions. In particular, the CDT algorithm extends the measure of information gain used by ID3. The split criterion is called the Imprecise Info-Gain (IIG). In Mantas and Abellán (2014a), credal decision trees are built by using an extension of the IIG criterion where the probability values of the class variable and features are estimated via imprecise probabilities.
The CDT algorithm obtains good experimental results (Abellán, Masegosa, 2009b, Abellán, Moral, 2005). Besides, its use with bagging ensemble (Abellán, Mantas, 2014, Abellán, Masegosa, 2009a, Abellán, Masegosa, 2012a) and its above mentioned extension (Mantas & Abellán, 2014a) are especially suitable when noisy data are classified. A complete and recent revision of machine learning methods to manipulate label noise can be found in Frenay and Verleysen (2014).
The theory of credal decision trees (Abellán & Moral, 2003) and the C4.5 algorithm are connected in Mantas and Abellán (2014b) with the definition of the Credal-C4.5 (C-C4.5) algorithm. This algorithm is similar to the one of the C4.5 algorithm, but it presents a new split criterion called Imprecise Info-Gain Ratio (IIGR), instead of the classic Info Gain Ratio (IGR) used in the C4.5 algorithm. IIGR is similar to IGR, replacing precise probabilities and entropy with imprecise probabilities (obtained with the application of the IDM) and maximum entropy measure, respectively. In this way, we have a new algorithm (C-C4.5) that uses a model of imprecise probabilities but, really, it builds standard decision trees. These trees are similar to the ones obtained with C4.5. This approach is different from other models, as the one called belief decision trees (Elouedi, Mellouli, & Smets, 2001), that use imprecise probabilities for building trees where leaves and nodes are represented by belief functions. The C-C4.5 could be used with belief functions in its leaves for imprecise classification, as with the CDT model used in Abellán and Masegosa (2012b).
In Mantas and Abellán (2014b), it was shown experimentally that C-C4.5 obtains better results that classic C4.5 when noisy data sets are classified. However, it was not given an detailed explanation for this experimental conclusion. For this reason, it is analyzed in this paper that the new split criterion used by C-C4.5 (IIGR) is less sensitive to noise than the split criteria of the classic decision trees.
Besides, the performance of the C-C4.5 algorithm depends on a parameter s. The algorithms C4.5 and C-C4.5 are equivalent when s=0. In this way, C-C4.5 can be interpreted as a parametric modification of the C4.5 algorithm. Both algorithms design the same type of decision tree, where the internal nodes are questions about the features and each leaf contains a value for the class variable. In order to continue analyzing the action of C-C4.5 when classifies noisy data sets, it is also studied in this paper the relation between the value of the parameter s and the classification of noisy data sets by using C-C4.5.
In an experimental study, we have compared C-C4.5 and classic C4.5 when they classify data sets with and without level of label noise. In this experimentation, different optimal values of s for each noise level are found. It can be deduced that the increase of the values s reduces the capacity of the procedure to model the distribution of a data set. In this way, the risk of overfitting on the learning data is decreased when upper values of s are used. With the aim of improve the accuracy, we show that the correct value for s depends on the noise level of a data set. This noise level can be estimated by using procedures presented in the literature, as the one in Končar (1997). To take into account a trade-off between complexity of the model and the level of cleaning of the data set is important to obtain better results. This trade-off can be attained with the appropriate selection of the value of s in each case.
Section 2 briefly describes the necessary previous knowledge about decision trees, credal decision trees and C-C4.5 algorithm. Sections 3 analyzes the differences between C-C4.5 and classic C4.5 when noisy data sets are used in classification. Section 4 studies the parameter s and its relation with the noise. Section 5 describes and comments the experimentation carried out with different values of the s parameter for the C-C4.5 procedure, on data sets varying the percentage of label noise. Finally, Section 6 is devoted to the conclusions.
Section snippets
Decision trees
Decision trees (DTs) are models based on a recursive partition method, the aim of which is to divide the data set using a single variable at each level. The process for inferring a decision tree is mainly determined by the followings aspects:
- a)
The criteria used to select the attribute to insert in a node and branching (split criteria).
- b)
The criteria to stop the tree from branching.
- c)
The method for assigning a class label or a probability distribution at the leaf nodes.
- d)
The post-pruning process used
C-C4.5 and classic C4.5 on data with noise
According to the previous sections, the main difference between C-C4.5 and classic C4.5 is the split criterion. C-C4.5 uses IIGR measure and C4.5 uses IGR. On the other hand, both measures (IIGR and IGR) are different because IGR utilizes the classic Shannon’s entropy (H) and IIGR uses the maximum of entropy function on a credal set (H*). It can be shown that the function H* is less sensitive to noise than the function H. Hence, C-C4.5 can outperform a classification task on noisy data sets
The parameter s and its relation with the noise
It can be interpreted that the IDM considers that the data distributions are not precise. According to Eq. (2), there is a number s of data that is not present for a variable Z and the values for these data are not known. Hence, it is considered that the information s can take any value. In this way, we employ a convex set of probability distributions (credal set Eq. (3)) instead of only one probability distribution to estimate the values of the variable Z.
If our data set is noisy, we have two
Experimental analysis
Our aim is to study the performance of C-C4.5 as opposed to classic C4.5 classifying data sets with different levels of noise. Besides, C-C4.5 will be executed with distinct values for the parameter s.
In order to check the above procedures, we used a broad and diverse set of 50 known data sets, obtained from the UCI repository of machine learning data sets which can be directly downloaded from http://archive.ics.uci.edu/ml. We took data sets that are different with respect to the number of
Conclusion
The application of the C-C4.5 algorithm on noisy data sets has been analyzed. It has been shown that the split criterion of the C-C4.5 algorithm is more robust to noise than the one of the C4.5 algorithm.
On the other hand, the performance of the C-C4.5 algorithm depends of the parameter s. In an experimental study, it has been studied the relation between the value of this parameter and the behavior of C-C4.5 when classify noisy data sets. Several experiments have been presented by using
Acknowledgment
This work has been supported by the Spanish “Ministerio de Economía y Competitividad” under Project TEC2015-69496-R.
References (33)
- et al.
Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
(2014) - et al.
Bagging schemes on the presence of class noise in classification
Expert Systems with Applications
(2012) - et al.
Upper entropy of credal sets. applications to credal classification
International Journal of Approximate Reasoning
(2005) Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)The use of ranks to avoid the assumption of normality implicit in the analysis of variance
Journal of the American Statistical Association
(1937)Distribution-free multiple comparisons
(1963)Induction of decision trees
Machine Learning
(1986)The naive credal classifier
Journal of Statistical Planning and Inference
(2002)Uncertainty measures on probability intervals from the imprecise dirichlet model
International Journal of General Systems
(2006)- et al.
Disaggregated total uncertainty measure for credal sets
International Journal of General Systems
(2006)
Requirements for total uncertainty measures in dempster-shafer theory of evidence
International Journal of General Systems
An experimental study about simple decision trees for bagging ensemble on datasets with classification noise
A filter-wrapper method to select variables for the naive bayes classifier based on credal decision trees
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Imprecise classification with credal decision trees
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Building classification trees using the total uncertainty criterion
International Journal of Intelligent Systems
An algorithm to compute the upper entropy for order-2 capacieties
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Cited by (34)
Using Credal C4.5 for Calibrated Label Ranking in Multi-Label Classification
2022, International Journal of Approximate ReasoningCredal sets representable by reachable probability intervals and belief functions
2021, International Journal of Approximate ReasoningCitation Excerpt :Probability intervals have high expressive power and can be efficiently computed. For these reasons, this theory has been commonly used in practical applications such as supervised classification [12–16]. An important property that probability intervals should satisfy is the reachability.
GIS-based ensemble soft computing models for landslide susceptibility mapping
2020, Advances in Space ResearchNon-parametric predictive inference for solving multi-label classification
2020, Applied Soft Computing JournalFriend recommendation for cross marketing in online brand community based on intelligent attention allocation link prediction algorithm
2020, Expert Systems with ApplicationsCitation Excerpt :In practice, it is often necessary to try all AAIs before finding the AAI that is most suitable for a given network. Since DTs are good at solving multi-class problems and can implicitly perform variables screening or features selection while requiring relatively little effort for data preparation (Mantas, Abellán, & Castellano, 2016), DT is developed to adaptively select the appropriate AAIs for the specific circle structure. The features of the network according to the idea of attention allocation of common neighbors in the triadic closure structure are fully described in the following two dimensions.