Elsevier

Information Sciences

Volume 180, Issue 8, 15 April 2010, Pages 1268-1291
Information Sciences

On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

https://doi.org/10.1016/j.ins.2009.12.014Get rights and content

Abstract

When performing a classification task, we may find some data-sets with a different class distribution among their patterns. This problem is known as classification with imbalanced data-sets and it appears in many real application areas. For this reason, it has recently become a relevant topic in the area of Machine Learning.

The aim of this work is to improve the behaviour of fuzzy rule based classification systems (FRBCSs) in the framework of imbalanced data-sets by means of a tuning step. Specifically, we adapt the 2-tuples based genetic tuning approach to classification problems showing the good synergy between this method and some FRBCSs.

Our empirical results show that the 2-tuples based genetic tuning increases the performance of FRBCSs in all types of imbalanced data. Furthermore, when the initial Rule Base, built by a fuzzy rule learning methodology, obtains a good behaviour in terms of accuracy, we achieve a higher improvement in performance for the whole model when applying the genetic 2-tuples post-processing step. This enhancement is also obtained in the case of cooperation with a preprocessing stage, proving the necessity of rebalancing the training set before the learning phase when dealing with imbalanced data.

Introduction

There are many tools in the context of Machine Learning for solving a classification problem. One of them, known as fuzzy rule based classification systems (FRBCSs) [43], has the advantage of being easily interpretable by the end user or the expert. The disadvantage of these systems is their lack of accuracy when dealing with some complex systems, i.e. high dimensional problems, when the classes are overlapped or in the presence of noise, due to the inflexibility of the concept of linguistic variables, which imposes hard restrictions on the fuzzy rule structure [9].

In the specialized literature we can find different proposals to increase the accuracy of linguistic fuzzy systems, both applied to modeling and classification problems [1], [12], [21]. These approaches try to induce better cooperation among the rules by acting on one or two different model components: the fuzzy partition parameters stored in the Data Base (DB) and the Rule Base (RB).

To ease the genetic optimization of the DB membership functions (MFs), a new linguistic rule representation model was proposed in [2]. It is based on the linguistic 2-tuples representation [40] that allows the lateral displacement of a label considering a unique parameter. This way of working involves a reduction in the search space that eases the derivation of optimal models. In [2], [3] the authors determined the high potential of this approach in regression problems, and our intention is to apply this genetic tuning to classification with imbalanced problems.

The problem of imbalanced data-sets [14], occurs when one class, usually the one that contains the concept to be learnt (the positive class), is underrepresented in the data-set. Addressing the class imbalance problem is a current challenge of the Data Mining community [72], and we must emphasize the significance of this situation since such types of data appears in most of the real domains of classification, i.e. risk management [42], medical diagnosis [54] and face recognition [52] among others.

Most learning algorithms obtain a high predictive accuracy over the majority class, but predict poorly over the minority class [67]. Furthermore, the examples of the minority class can be treated as noise and they might be completely ignored by the classifier. There are studies that show that most classification methods lose their classification ability when dealing with imbalanced data [47], [57].

The aim of this study is to improve the results obtained by FRBCSs in imbalanced data-sets by means of the application of the 2-tuples based genetic tuning. We want to enhance the performance of our fuzzy model to make it competitive with C4.5 [59], a decision tree algorithm that presents a good behaviour in imbalanced data-sets [55], [61], [62], and with Ripper [17], a traditional and accurate rule based classifier algorithm. We will also show that we can obtain a fuzzy classification model with a lower complexity than the standard interval rule learning algorithms, together with an intrinsic higher interpretability because of the use of fuzzy labels, as we have stated at the beginning of this section.

In this paper we use two learning methods in order to generate the RB for the FRBCS. The first one is the method proposed in [16], that we have named the Chi et al.’s rule generation. The second approach is defined by Ishibuchi and Yamamoto in [45] and it consists of a Fuzzy Hybrid Genetic Based Machine Learning (FH-GBML) algorithm.

In our first study on the topic [33], we analysed the behaviour of FRBCSs looking for the best configuration of the fuzzy components and the synergy with preprocessing techniques to deal with the problem of imbalanced data-sets. According to the decisions taken in that work, in this paper we will use triangular membership functions for the fuzzy partitions and rule weights in the consequent of the rules. We will study the use of the 2-tuples tuning directly over the original data-sets using the appropriate measure of performance to guide the search, but we will also apply a re-sampling procedure as a solution at the data level to deal with the imbalance problem, specifically using the “Synthetic Minority Over-sampling Technique” (SMOTE) [13] to prepare the training data for the learning process.

The rest of this paper is organized as follows: In Section 2, we present the imbalanced data-set problem, describing the preprocessing technique used in our work, the SMOTE algorithm, and discussing the evaluation metrics. In Section 3, we describe the fuzzy rule learning methodologies used in this study. Next, Section 4 shows the significance of the tuning of the fuzzy systems and introduces the 2-tuples tuning approach and the evolutionary algorithm that tunes the FRBCS. In Section 5, we include our experimental analysis in imbalanced data-sets with different degrees of imbalance, where we compare the FRBCSs with 2-tuples based genetic tuning with Ripper and C4.5, in order to validate our results. In Section 6, some concluding remarks and suggestions for further work are made. Finally, we include an appendix with the detailed results for the experiments performed in the experimental study.

Section snippets

Imbalanced data-sets in classification

In this section, we will first introduce the problem of imbalanced data-sets. Then, we will describe the preprocessing technique we have applied in order to deal with the imbalanced data-sets: the SMOTE algorithm. Finally, we will present the evaluation metrics for this type of classification problem.

Fuzzy rule based classification system learning methods

Any classification problem consists of m training patterns xp=(xp1,,xpn),p=1,2,,m from M classes where xpi is the ith attribute value (i=1,2,,n) of the pth training pattern.

In this work we use fuzzy rules of the following form for our FRBCSs:RuleRj:Ifx1isAj1andandxnisAjnthen Class=CjwithRWjwhere Rj is the label of the jth rule, x=(x1,,xn) is an n-dimensional pattern vector, Aji is an antecedent fuzzy set, Cj is a class label, and RWj is the rule weight [44], [74]. We use triangular MFs as

Genetic tuning of the fuzzy rule based classification systems

The main objective of this work is to improve the performance of FRBCSs in the framework of imbalanced data-sets by means of a tuning approach based on 2-tuples, stressing the positive synergy between this genetic tuning and the FRBCSs in this specific scenario. This methodology consists of refining a previous definition of the DB once the RB has been obtained [4], [46], [48]. The tuning introduces a variation in the shape of the MFs that improves their global interaction with the main aim of

Experimental study

In this paper, we use the IR to distinguish between two classes of imbalanced data-sets: data-sets with a low imbalance, when the instances of the positive class are between 10% and 40% of the total instances (IR between 1.5 and 9), and data-sets with a high imbalance, where there are no more than 10% of positive instances in the whole data-set compared to the negative ones (IR higher than 9).

We have considered 44 data-sets from the UCI repository [7] with different IR. Table 2 summarizes the

Concluding remarks and further work

In this work, we have adapted the 2-tuples based genetic tuning to classification problems with imbalanced data-sets in order to increase the performance of simple FRBCSs.

We have concluded that the tuning step is a necessity, since it always helps FRBCSs to obtain better results. Our empirical and statistical results have shown that the genetic tuning improves the behaviour of the FRBCS in imbalanced data-sets, both globally and for the different types considered, that is, data-sets with a low

Acknowledgment

This work had been supported by the Spanish Ministry of Science and Technology under Projects TIN2008-06681-C06-01, TIN2008-06681-C06-02, and the Andalusian Research Plan TIC-3928.

References (74)

  • A. Fernández et al.

    On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets

    Expert Systems with Applications

    (2009)
  • A. Fernández et al.

    A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

    Fuzzy Sets and Systems

    (2008)
  • F. Herrera et al.

    Tuning fuzzy logic controllers by genetic algorithms

    International Journal of Approximate Reasoning

    (1995)
  • Y.M. Huang et al.

    Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem

    Nonlinear Analysis: Real World Applications

    (2006)
  • K. Kilic et al.

    Comparison of different strategies of utilizing fuzzy clustering in structure identification

    Information Sciences

    (2007)
  • M. Li et al.

    A hybrid coevolutionary algorithm for designing fuzzy classifiers

    Information Sciences

    (2009)
  • M. Mazurowski et al.

    Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance

    Neural Networks

    (2008)
  • X. Peng et al.

    Robust BMPM training based on second-order cone programming and its application in medical diagnosis

    Neural Networks

    (2008)
  • C.-T. Su et al.

    Knowledge acquisition through information granulation for imbalanced data

    Expert Systems with Applications

    (2006)
  • Y. Sun et al.

    Cost-sensitive boosting for classification of imbalanced data

    Pattern Recognition

    (2007)
  • S. Suresh et al.

    Risk-sensitive loss functions for sparse multi-category classification problems

    Information Sciences

    (2008)
  • M.J. Zolghadri et al.

    Weighting fuzzy classification rules using receiver operating characteristics (ROC) analysis

    Information Sciences

    (2007)
  • R. Alcalá et al.

    Hybrid learning models to get the interpretability-accuracy trade-off in fuzzy modeling

    Soft Computing

    (2006)
  • R. Alcalá et al.

    A proposal for the genetic lateral tuning of linguistic fuzzy systems and its interaction with rule selection

    IEEE Transactions on Fuzzy Systems

    (2007)
  • R. Alcalá et al.

    Fuzzy control of HVAC systems optimized by genetic algorithms

    Applied Intelligence

    (2003)
  • J. Alcalá-Fdez et al.

    Increasing fuzzy rules cooperation based on evolutionary adaptive inference systems

    International Journal of Intelligent Systems

    (2007)
  • J. Alcalá-Fdez et al.

    KEEL: a software tool to assess evolutionary algorithms to data mining problems

    Soft Computing

    (2009)
  • A. Asuncion, D. Newman, UCI machine learning repository, University of California, Irvine, School of Information and...
  • A. Bastian

    How to handle the flexibility of linguistic variables with applications

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

    (1994)
  • G.E.A.P.A. Batista et al.

    A study of the behaviour of several methods for balancing machine learning training data

    SIGKDD Explorations

    (2004)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    Journal of Artificial Intelligent Research

    (2002)
  • N.V. Chawla et al.

    Editorial: special issue on learning from imbalanced data sets

    SIGKDD Explorations

    (2004)
  • Z. Chi et al.

    Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition

    (1996)
  • O. Cordón et al.

    Genetic Fuzzy Systems. Evolutionary Tuning and Learning of Fuzzy Knowledge Bases

    (2001)
  • K.A. Crockett et al.

    Genetic tuning of fuzzy inference within fuzzy classifier systems

    Expert Systems

    (2006)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • Cited by (0)

    View full text