Knowledge acquisition through information granulation for imbalanced data

https://doi.org/10.1016/j.eswa.2005.09.082Get rights and content

Abstract

When learning from imbalanced/skewed data, which almost all the instances are labeled as one class while far few instances are labeled as the other class, traditional machine learning algorithms tend to produce high accuracy over the majority class but poor predictive accuracy over the minority class. This paper proposes a novel method called ‘knowledge acquisition via information granulation’ (KAIG) model which not only can remove some unnecessary details and provide a better insight into the essence of data but also effectively solve ‘class imbalance’ problems. In this model, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity. We also developed the concept of sub-attributes to describe granules and tackle the overlapping among granules. Seven data sets from UCI data bank, including one imbalanced diagnosis data (pima-Indians-diabetes), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, overall accuracy, G-mean and Receiver Operation Characteristic (ROC) curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method.

Introduction

Learning from imbalanced/skewed data is an important topic and rises very often in practice. In such kind of data, one class might be represented by a large number of examples while the other is represented by only a few. Many real world data have these characteristics, such as fraud detection, text classification (Chawla et al., 2002, Chawla et al., 2004) telecommunications management, oil spill detection, risk management, medical diagnosis/monitoring, financial analysis of loan policy or bankruptcy (Batista et al., 2004, Chawla et al., 2004, Grzymala-Busse et al., 2004) and protein data (Provost & Fawcett, 2001). Traditional classifiers seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks (Batista et al., 2004, Chawla et al., 2004, Guo and Viktor, 2004) since they tend to classify all data into the majority class, which is usually the less important class. Therefore, these traditional algorithms often produce high accuracy over the majority class, but poor predictive accuracy over the minority class.

To cope with imbalanced data sets, there are some methods proposed in literatures, such as the methods of sampling (Batista et al., 2004, Chawla et al., 2002, Guo and Viktor, 2004), adjusting the cost-matrices (Cristianini & Shawe-Taylor, 2000), and moving the decision thresholds (Chawla et al., 2002, Huang et al., 2004, Jo and Japkowicz, 2004). Sampling methods reduce data imbalance—by ‘down-sampling’ (removing) instances from majority class or ‘up-sampling’ (duplicating) the training instances from the minority class or both. The second kind of methods improves the prediction accuracy by adjusting the cost (weight) for each class or changing the strength of rules (Batista et al., 2004). The third school of methods tries to adapt the decision thresholds to impose bias on the minority class. However, these three schools of methods lack a rigorous and systematic treatment on imbalanced data (Huang et al., 2004). For example, down-sampling the data will lose information, while up-sampling will introduce noise.

In this study, we introduce the concept of ‘information granulation’ to solve class imbalance problems. There are two reasons why we propose this concept to tackle this issue. The first one is human instinct. As human beings, we have developed a granular view of the world. When describing a problem, we tend to shy away from numbers and use aggregates to ponder the question instead. This is especially true when a problem involves incomplete, uncertain, or vague information. It may be sometimes difficult to differentiate distinct elements, and so one is forced to consider ‘information granules’ (IG) which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability (Bargiela and Pedrycz, 2003, Castellano and Fanelli, 2001, Yao and Yao, 2002, Zadeh, 1979). A typical example is the theory of rough sets (Walczak & Massart, 1999). The process of constructing IGs is referred to as information granulation. This was first pointed out in the pioneering work of Zadeh (1979) who coined the term ‘information granulation’, and emphasized the fact that a plethora of details does not necessarily amount to knowledge. Granulation serves as an abstraction mechanism for reducing an entire conceptual burden. The essential factor driving the granulation of information is the need to comprehend the problem and have a better insight into its essence, rather than get buried in all the unnecessary details. By changing the size of the IGs, we can hide or reveal more or less details (Bargiela & Pedrycz, 2003).

The second reason is about the behavior of data. In many practical datasets, such as medical/diagnosis, inspection, fault monitoring and fraud detecting data, the normal group and abnormal group are considered separate populations. Taguchi & Juoulum (2002) thought every abnormal condition (or a condition outside ‘healthy’ group) is considered unique, since the occurrence of such a condition is different. Tolstoy's quote in Anna Karenina: ‘All happy families look alike. Every unhappy family is unhappy after its own fashion’ is also noted to illustrate their opinions (Taguchi & Juoulum, 2002). Therefore, we can clearly understand the normal group (i.e. healthy patients, good products) looks alike while the abnormal group (i.e. sick patients, defective products) is unique. If we construct IGs by the similarity of numerical data, the amount of IGs in normal group will be remarkably smaller than the size of normal numerical data. In other words, if we consider IGs instead of numerical data, it might increase the proportion of abnormal data and improve imbalanced/skewed situation of data.

In this study, we propose a ‘knowledge acquisition via information granulation’ (KAIG) model which can improve classification performance by controlling the reduction of unnecessary details. In KAIG model, Fuzzy ART (Adaptive resonance theory) neural network is utilized to construct IGs. The two indexes, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio), are developed to determine a suitable level of granularity. The concept of sub-attributes is presented to tackle the overlapping among granules. Six data sets (one for illustrative example) from data bank are employed to illustrate our method and evaluate the effectiveness of our proposed model. Besides, one imbalanced diagnosis dataset, pima-Indians-diabetes, is provided to demonstrate the superiority of our method in solving class imbalance class problem by using the indexes, overall accuracy, G-mean and receiver operation characteristic (ROC) curve.

Section snippets

Granular computing

Granular computing, which is oriented towards the representation and processing of IGs, is quickly becoming an emerging conceptual and computing paradigm of information processing (Bargiela & Pedrycz, 2003). It is a superset of the theory of fuzzy information granulation, rough set theory and interval computations, and is a subset of granular mathematics. Granular computing as opposed to numeric computing is knowledge-oriented. Numeric computing is data oriented. The main issues (Castellano &

Proposed methodologies

This section describes in detail the procedure of the KAIG model. First, we address how the IGs are formed from numerical data. Secondly, H-index and U-ratio are introduced to determine the level of granularity which can be used to construct IGs in Fuzzy ART. Then, we try to describe IGs by using sub-attributes and extract knowledge from them. The well-known dataset, iris, will serve as an illustrative example.

Evaluation of KAIG model

To evaluate the effectiveness of the KAIG model, five data sets which come from databank of UCI machine learning group (http://www.ics.uci.edu/∼mlearn/) are considered in this section. Table 9 provides brief explanation about the data background, including data size, number of features, data characteristics (binary/continuous), and defined classes. Before implementing, we divide all data sets into training set and testing set with the proportion of 3:1.

With the help of the H-index and the U

Implementation in imbalanced data

This section will apply KAIG method to overcome the class imbalance problems. C4.5 and SVM are usually utilized as benchmarks or basic learners in related works (Batista et al., 2004, Guo and Viktor, 2004, Huang et al., 2004, Jo and Japkowicz, 2004, Provost and Fawcett, 2001 Radivojac, Chawla, Dunker, & Obradovic, 2004). Therefore, the experimental results of KAIG will be compared with these two methods. A brief introduction about SVM can be found in (Cristianini and Shawe-Taylor, 2000, Wu and

Conclusions

This study introduces the concept of information granulation to solve class imbalance problems. A novel method called KAIG model is presented. In this model, we propose two indexes to determine the level of granularity and the ‘sub-attributes’ concept to describe IGs. The experimental results show that the KAIG model can improve classification performance by reducing unnecessary details of information. We also demonstrate that the proposed method has excellent ability of identifying the

Acknowledgements

This work was supported in part by National Science Council of Taiwan (Grant No. NSC 94-2213-E007-059).

References (25)

  • A. Estabrooks et al.

    A multiple resampling methods for learning from imbalanced data sets

    Computational Intelligence

    (2004)
  • J.W. Grzymala-Busse et al.

    A comparison of two approaches to data mining from imbalanced data

    Lecture Notes in Computer Science

    (2004)
  • Cited by (79)

    • Forecasting lightning around the Korean Peninsula by postprocessing ECMWF data using SVMs and undersampling

      2020, Atmospheric Research
      Citation Excerpt :

      The accuracy of a classifier can be expressed as the proportion of correct classifications, and the accuracy of a regression function as a mean squared error. When the data is unbalanced, because one outcome is much more common than another, these measures of accuracy can be misleading (Chawla et al., 2002; He and Garcia, 2009; Liu et al., 2009; Su et al., 2006; Sun et al., 2007). For example, only 1.5% of the data in our dataset corresponds to the occurrence of lightning.

    View all citing articles on Scopus
    View full text