On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

doi:10.1016/j.ins.2009.12.014

Information Sciences

Volume 180, Issue 8, 15 April 2010, Pages 1268-1291

https://doi.org/10.1016/j.ins.2009.12.014 Get rights and content

Abstract

When performing a classification task, we may find some data-sets with a different class distribution among their patterns. This problem is known as classification with imbalanced data-sets and it appears in many real application areas. For this reason, it has recently become a relevant topic in the area of Machine Learning.

The aim of this work is to improve the behaviour of fuzzy rule based classification systems (FRBCSs) in the framework of imbalanced data-sets by means of a tuning step. Specifically, we adapt the 2-tuples based genetic tuning approach to classification problems showing the good synergy between this method and some FRBCSs.

Our empirical results show that the 2-tuples based genetic tuning increases the performance of FRBCSs in all types of imbalanced data. Furthermore, when the initial Rule Base, built by a fuzzy rule learning methodology, obtains a good behaviour in terms of accuracy, we achieve a higher improvement in performance for the whole model when applying the genetic 2-tuples post-processing step. This enhancement is also obtained in the case of cooperation with a preprocessing stage, proving the necessity of rebalancing the training set before the learning phase when dealing with imbalanced data.

Introduction

There are many tools in the context of Machine Learning for solving a classification problem. One of them, known as fuzzy rule based classification systems (FRBCSs) [43], has the advantage of being easily interpretable by the end user or the expert. The disadvantage of these systems is their lack of accuracy when dealing with some complex systems, i.e. high dimensional problems, when the classes are overlapped or in the presence of noise, due to the inflexibility of the concept of linguistic variables, which imposes hard restrictions on the fuzzy rule structure [9].

In the specialized literature we can find different proposals to increase the accuracy of linguistic fuzzy systems, both applied to modeling and classification problems [1], [12], [21]. These approaches try to induce better cooperation among the rules by acting on one or two different model components: the fuzzy partition parameters stored in the Data Base (DB) and the Rule Base (RB).

To ease the genetic optimization of the DB membership functions (MFs), a new linguistic rule representation model was proposed in [2]. It is based on the linguistic 2-tuples representation [40] that allows the lateral displacement of a label considering a unique parameter. This way of working involves a reduction in the search space that eases the derivation of optimal models. In [2], [3] the authors determined the high potential of this approach in regression problems, and our intention is to apply this genetic tuning to classification with imbalanced problems.

The problem of imbalanced data-sets [14], occurs when one class, usually the one that contains the concept to be learnt (the positive class), is underrepresented in the data-set. Addressing the class imbalance problem is a current challenge of the Data Mining community [72], and we must emphasize the significance of this situation since such types of data appears in most of the real domains of classification, i.e. risk management [42], medical diagnosis [54] and face recognition [52] among others.

Most learning algorithms obtain a high predictive accuracy over the majority class, but predict poorly over the minority class [67]. Furthermore, the examples of the minority class can be treated as noise and they might be completely ignored by the classifier. There are studies that show that most classification methods lose their classification ability when dealing with imbalanced data [47], [57].

The aim of this study is to improve the results obtained by FRBCSs in imbalanced data-sets by means of the application of the 2-tuples based genetic tuning. We want to enhance the performance of our fuzzy model to make it competitive with C4.5 [59], a decision tree algorithm that presents a good behaviour in imbalanced data-sets [55], [61], [62], and with Ripper [17], a traditional and accurate rule based classifier algorithm. We will also show that we can obtain a fuzzy classification model with a lower complexity than the standard interval rule learning algorithms, together with an intrinsic higher interpretability because of the use of fuzzy labels, as we have stated at the beginning of this section.

In this paper we use two learning methods in order to generate the RB for the FRBCS. The first one is the method proposed in [16], that we have named the Chi et al.’s rule generation. The second approach is defined by Ishibuchi and Yamamoto in [45] and it consists of a Fuzzy Hybrid Genetic Based Machine Learning (FH-GBML) algorithm.

In our first study on the topic [33], we analysed the behaviour of FRBCSs looking for the best configuration of the fuzzy components and the synergy with preprocessing techniques to deal with the problem of imbalanced data-sets. According to the decisions taken in that work, in this paper we will use triangular membership functions for the fuzzy partitions and rule weights in the consequent of the rules. We will study the use of the 2-tuples tuning directly over the original data-sets using the appropriate measure of performance to guide the search, but we will also apply a re-sampling procedure as a solution at the data level to deal with the imbalance problem, specifically using the “Synthetic Minority Over-sampling Technique” (SMOTE) [13] to prepare the training data for the learning process.

The rest of this paper is organized as follows: In Section 2, we present the imbalanced data-set problem, describing the preprocessing technique used in our work, the SMOTE algorithm, and discussing the evaluation metrics. In Section 3, we describe the fuzzy rule learning methodologies used in this study. Next, Section 4 shows the significance of the tuning of the fuzzy systems and introduces the 2-tuples tuning approach and the evolutionary algorithm that tunes the FRBCS. In Section 5, we include our experimental analysis in imbalanced data-sets with different degrees of imbalance, where we compare the FRBCSs with 2-tuples based genetic tuning with Ripper and C4.5, in order to validate our results. In Section 6, some concluding remarks and suggestions for further work are made. Finally, we include an appendix with the detailed results for the experiments performed in the experimental study.

Section snippets

Imbalanced data-sets in classification

In this section, we will first introduce the problem of imbalanced data-sets. Then, we will describe the preprocessing technique we have applied in order to deal with the imbalanced data-sets: the SMOTE algorithm. Finally, we will present the evaluation metrics for this type of classification problem.

Fuzzy rule based classification system learning methods

Any classification problem consists of m training patterns $x_{p} = (x_{p 1}, \dots, x_{pn}), p = 1, 2, \dots, m$ from M classes where $x_{pi}$ is the ith attribute value ( $i = 1, 2, \dots, n$ ) of the pth training pattern.

In this work we use fuzzy rules of the following form for our FRBCSs: $Rule R_{j} : If x_{1} is A_{j 1} and \dots and x_{n} is A_{jn} then Class = C_{j} with {RW}_{j}$ where $R_{j}$ is the label of the jth rule, $x = (x_{1}, \dots, x_{n})$ is an n-dimensional pattern vector, $A_{ji}$ is an antecedent fuzzy set, $C_{j}$ is a class label, and ${RW}_{j}$ is the rule weight [44], [74]. We use triangular MFs as

Genetic tuning of the fuzzy rule based classification systems

The main objective of this work is to improve the performance of FRBCSs in the framework of imbalanced data-sets by means of a tuning approach based on 2-tuples, stressing the positive synergy between this genetic tuning and the FRBCSs in this specific scenario. This methodology consists of refining a previous definition of the DB once the RB has been obtained [4], [46], [48]. The tuning introduces a variation in the shape of the MFs that improves their global interaction with the main aim of

Experimental study

In this paper, we use the IR to distinguish between two classes of imbalanced data-sets: data-sets with a low imbalance, when the instances of the positive class are between 10% and 40% of the total instances (IR between 1.5 and 9), and data-sets with a high imbalance, where there are no more than 10% of positive instances in the whole data-set compared to the negative ones (IR higher than 9).

We have considered 44 data-sets from the UCI repository [7] with different IR. Table 2 summarizes the

Concluding remarks and further work

In this work, we have adapted the 2-tuples based genetic tuning to classification problems with imbalanced data-sets in order to increase the performance of simple FRBCSs.

We have concluded that the tuning step is a necessity, since it always helps FRBCSs to obtain better results. Our empirical and statistical results have shown that the genetic tuning improves the behaviour of the FRBCS in imbalanced data-sets, both globally and for the different types considered, that is, data-sets with a low

Acknowledgment

This work had been supported by the Spanish Ministry of Science and Technology under Projects TIN2008-06681-C06-01, TIN2008-06681-C06-02, and the Andalusian Research Plan TIC-3928.

References (74)

R. Alcalá et al.
Genetic learning of accurate and compact fuzzy rule based systems based on the 2-tuples linguistic representation
International Journal of Approximate Reasoning
(2007)
R. Barandela et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003)
A.P. Bradley
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition
(1997)
M.-C. Chen et al.
An information granulation based data mining approach for classifying imbalanced data
Information Sciences
(2008)
W.W. Cohen
Fast effective rule induction
O. Cordón et al.
Ten years of genetic fuzzy systems: current framework and new trends
Fuzzy Sets and Systems
(2004)
O. Cordón et al.
A three-stage evolutionary process for learning descriptive and approximate fuzzy logic controller knowledge bases from examples
International Journal of Approximate Reasoning
(1997)
O. Cordón et al.
A genetic learning process for the scaling factors, granularity and contexts of the fuzzy rule-based system data base
Information Sciences
(2001)
K.A. Crockett et al.
On constructing a fuzzy inference framework using crisp decision trees
Fuzzy Sets and Systems
(2006)
A. Fernández et al.
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets
International Journal of Approximate Reasoning
(2009)

J. Alcalá-Fdez et al.

Increasing fuzzy rules cooperation based on evolutionary adaptive inference systems

International Journal of Intelligent Systems

(2007)

J. Alcalá-Fdez et al.

KEEL: a software tool to assess evolutionary algorithms to data mining problems

Soft Computing

(2009)

A. Asuncion, D. Newman, UCI machine learning repository, University of California, Irvine, School of Information and...

A. Bastian

How to handle the flexibility of linguistic variables with applications

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

(1994)

G.E.A.P.A. Batista et al.

A study of the behaviour of several methods for balancing machine learning training data

SIGKDD Explorations

(2004)

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligent Research

(2002)

N.V. Chawla et al.

Editorial: special issue on learning from imbalanced data sets

SIGKDD Explorations

(2004)

Z. Chi et al.

Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition

(1996)

O. Cordón et al.

Genetic Fuzzy Systems. Evolutionary Tuning and Learning of Fuzzy Knowledge Bases

(2001)

K.A. Crockett et al.

Genetic tuning of fuzzy inference within fuzzy classifier systems

Expert Systems

(2006)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

Cited by (0)

View full text