Knowledge acquisition through information granulation for imbalanced data

doi:10.1016/j.eswa.2005.09.082

Expert Systems with Applications

Volume 31, Issue 3, October 2006, Pages 531-541

https://doi.org/10.1016/j.eswa.2005.09.082 Get rights and content

Abstract

When learning from imbalanced/skewed data, which almost all the instances are labeled as one class while far few instances are labeled as the other class, traditional machine learning algorithms tend to produce high accuracy over the majority class but poor predictive accuracy over the minority class. This paper proposes a novel method called ‘knowledge acquisition via information granulation’ (KAIG) model which not only can remove some unnecessary details and provide a better insight into the essence of data but also effectively solve ‘class imbalance’ problems. In this model, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity. We also developed the concept of sub-attributes to describe granules and tackle the overlapping among granules. Seven data sets from UCI data bank, including one imbalanced diagnosis data (pima-Indians-diabetes), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, overall accuracy, G-mean and Receiver Operation Characteristic (ROC) curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method.

Introduction

Learning from imbalanced/skewed data is an important topic and rises very often in practice. In such kind of data, one class might be represented by a large number of examples while the other is represented by only a few. Many real world data have these characteristics, such as fraud detection, text classification (Chawla et al., 2002, Chawla et al., 2004) telecommunications management, oil spill detection, risk management, medical diagnosis/monitoring, financial analysis of loan policy or bankruptcy (Batista et al., 2004, Chawla et al., 2004, Grzymala-Busse et al., 2004) and protein data (Provost & Fawcett, 2001). Traditional classifiers seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks (Batista et al., 2004, Chawla et al., 2004, Guo and Viktor, 2004) since they tend to classify all data into the majority class, which is usually the less important class. Therefore, these traditional algorithms often produce high accuracy over the majority class, but poor predictive accuracy over the minority class.

To cope with imbalanced data sets, there are some methods proposed in literatures, such as the methods of sampling (Batista et al., 2004, Chawla et al., 2002, Guo and Viktor, 2004), adjusting the cost-matrices (Cristianini & Shawe-Taylor, 2000), and moving the decision thresholds (Chawla et al., 2002, Huang et al., 2004, Jo and Japkowicz, 2004). Sampling methods reduce data imbalance—by ‘down-sampling’ (removing) instances from majority class or ‘up-sampling’ (duplicating) the training instances from the minority class or both. The second kind of methods improves the prediction accuracy by adjusting the cost (weight) for each class or changing the strength of rules (Batista et al., 2004). The third school of methods tries to adapt the decision thresholds to impose bias on the minority class. However, these three schools of methods lack a rigorous and systematic treatment on imbalanced data (Huang et al., 2004). For example, down-sampling the data will lose information, while up-sampling will introduce noise.

In this study, we introduce the concept of ‘information granulation’ to solve class imbalance problems. There are two reasons why we propose this concept to tackle this issue. The first one is human instinct. As human beings, we have developed a granular view of the world. When describing a problem, we tend to shy away from numbers and use aggregates to ponder the question instead. This is especially true when a problem involves incomplete, uncertain, or vague information. It may be sometimes difficult to differentiate distinct elements, and so one is forced to consider ‘information granules’ (IG) which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability (Bargiela and Pedrycz, 2003, Castellano and Fanelli, 2001, Yao and Yao, 2002, Zadeh, 1979). A typical example is the theory of rough sets (Walczak & Massart, 1999). The process of constructing IGs is referred to as information granulation. This was first pointed out in the pioneering work of Zadeh (1979) who coined the term ‘information granulation’, and emphasized the fact that a plethora of details does not necessarily amount to knowledge. Granulation serves as an abstraction mechanism for reducing an entire conceptual burden. The essential factor driving the granulation of information is the need to comprehend the problem and have a better insight into its essence, rather than get buried in all the unnecessary details. By changing the size of the IGs, we can hide or reveal more or less details (Bargiela & Pedrycz, 2003).

The second reason is about the behavior of data. In many practical datasets, such as medical/diagnosis, inspection, fault monitoring and fraud detecting data, the normal group and abnormal group are considered separate populations. Taguchi & Juoulum (2002) thought every abnormal condition (or a condition outside ‘healthy’ group) is considered unique, since the occurrence of such a condition is different. Tolstoy's quote in Anna Karenina: ‘All happy families look alike. Every unhappy family is unhappy after its own fashion’ is also noted to illustrate their opinions (Taguchi & Juoulum, 2002). Therefore, we can clearly understand the normal group (i.e. healthy patients, good products) looks alike while the abnormal group (i.e. sick patients, defective products) is unique. If we construct IGs by the similarity of numerical data, the amount of IGs in normal group will be remarkably smaller than the size of normal numerical data. In other words, if we consider IGs instead of numerical data, it might increase the proportion of abnormal data and improve imbalanced/skewed situation of data.

In this study, we propose a ‘knowledge acquisition via information granulation’ (KAIG) model which can improve classification performance by controlling the reduction of unnecessary details. In KAIG model, Fuzzy ART (Adaptive resonance theory) neural network is utilized to construct IGs. The two indexes, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio), are developed to determine a suitable level of granularity. The concept of sub-attributes is presented to tackle the overlapping among granules. Six data sets (one for illustrative example) from data bank are employed to illustrate our method and evaluate the effectiveness of our proposed model. Besides, one imbalanced diagnosis dataset, pima-Indians-diabetes, is provided to demonstrate the superiority of our method in solving class imbalance class problem by using the indexes, overall accuracy, G-mean and receiver operation characteristic (ROC) curve.

Section snippets

Granular computing

Granular computing, which is oriented towards the representation and processing of IGs, is quickly becoming an emerging conceptual and computing paradigm of information processing (Bargiela & Pedrycz, 2003). It is a superset of the theory of fuzzy information granulation, rough set theory and interval computations, and is a subset of granular mathematics. Granular computing as opposed to numeric computing is knowledge-oriented. Numeric computing is data oriented. The main issues (Castellano &

Proposed methodologies

This section describes in detail the procedure of the KAIG model. First, we address how the IGs are formed from numerical data. Secondly, H-index and U-ratio are introduced to determine the level of granularity which can be used to construct IGs in Fuzzy ART. Then, we try to describe IGs by using sub-attributes and extract knowledge from them. The well-known dataset, iris, will serve as an illustrative example.

Evaluation of KAIG model

To evaluate the effectiveness of the KAIG model, five data sets which come from databank of UCI machine learning group (http://www.ics.uci.edu/∼mlearn/) are considered in this section. Table 9 provides brief explanation about the data background, including data size, number of features, data characteristics (binary/continuous), and defined classes. Before implementing, we divide all data sets into training set and testing set with the proportion of 3:1.

With the help of the H-index and the U

Implementation in imbalanced data

This section will apply KAIG method to overcome the class imbalance problems. C4.5 and SVM are usually utilized as benchmarks or basic learners in related works (Batista et al., 2004, Guo and Viktor, 2004, Huang et al., 2004, Jo and Japkowicz, 2004, Provost and Fawcett, 2001 Radivojac, Chawla, Dunker, & Obradovic, 2004). Therefore, the experimental results of KAIG will be compared with these two methods. A brief introduction about SVM can be found in (Cristianini and Shawe-Taylor, 2000, Wu and

Conclusions

This study introduces the concept of information granulation to solve class imbalance problems. A novel method called KAIG model is presented. In this model, we propose two indexes to determine the level of granularity and the ‘sub-attributes’ concept to describe IGs. The experimental results show that the KAIG model can improve classification performance by reducing unnecessary details of information. We also demonstrate that the proposed method has excellent ability of identifying the

Acknowledgements

This work was supported in part by National Science Council of Taiwan (Grant No. NSC 94-2213-E007-059).

References (25)

L. Burke et al.
Neural networks and the part family/machine group formation problem in cellular manufacturing: A framework using fuzzy art
Journal of Manufacturing Systems
(1995)
G.A. Carpenter et al.
Fuzzy art: Fast stable learning and categorization of analog patterns by an adaptive resonance system
Neural Networks
(1991)
P. Radivojac et al.
Classification and knowledge discovery in protein databases
Journal of Biomedical Informatics
(2004)
B. Walczak et al.
Tutorial: Rough sets theory
Chemometrics and Intelligent Laboratory Systems
(1999)
A. Bargiela et al.
Granular computing: An introduction
(2003)
G. Batista et al.
A study of the behavior of several methods for balancing machine learning training data
SIGKDD Explorations
(2004)
G. Castellano et al.
Information granulation via neural network-based learning
(2001)
N.V. Chawla et al.
SMOTE: Synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
(2002)
N.V. Chawla et al.
Editorial: Special issue on learning from imbalanced data sets
SIGKDD Explorations
(2004)
N. Cristianini et al.
An introduction to support vector machines and other kernel-based learning methods
(2000)

A. Estabrooks et al.

A multiple resampling methods for learning from imbalanced data sets

Computational Intelligence

(2004)

J.W. Grzymala-Busse et al.

A comparison of two approaches to data mining from imbalanced data

Lecture Notes in Computer Science

(2004)

Cited by (79)

EM-IFCM: Fuzzy c-means clustering algorithm based on edge modification for imbalanced data
2024, Information Sciences
The improved fuzzy c-means (IFCM) algorithm is an effective technique for handling the “uniform effect” in imbalanced data clustering; it adjusts the weight of each class based on the fuzzy size between clusters. However, the IFCM algorithm produces a “siphon effect” as the imbalance rate increases. It misclassifies the samples in small classes into large ones. Our analysis shows that this effect occurs because all samples have the same weight value of the same classes, the membership values are polarized, resulting in the model failing to converge to the correct interval. Thus, we propose an imbalanced fuzzy c-means clustering based on edge modification (EM-IFCM) algorithm to alleviate the “siphon effect” of the IFCM algorithm. It exhibits stronger inter-class separability by dynamically adjusting the weight of the samples to enhance the influence of edge samples on the model. In addition, we analyze the effectiveness and complexity of the algorithm and proved its convergence. Finally, we conduct extensive experiments on synthesis, machine-learning, and image-segmentation datasets and compare the results with those of six algorithms. The experimental results show that EM-IFCM has higher accuracy and exhibits an imbalance rate that is at least 1.94 times higher than that of the other algorithms.
Forecasting lightning around the Korean Peninsula by postprocessing ECMWF data using SVMs and undersampling
2020, Atmospheric Research
Citation Excerpt :
The accuracy of a classifier can be expressed as the proportion of correct classifications, and the accuracy of a regression function as a mean squared error. When the data is unbalanced, because one outcome is much more common than another, these measures of accuracy can be misleading (Chawla et al., 2002; He and Garcia, 2009; Liu et al., 2009; Su et al., 2006; Sun et al., 2007). For example, only 1.5% of the data in our dataset corresponds to the occurrence of lightning.
We use machine learning to generate binary forecasts of the occurrence of lightning within a particular location and time interval. The training data is weather variables found in the forecasts from the European Centre for Medium-range Weather Forecasts, correlated against subsequent lightning reports, for a region containing the Korean Peninsula. Lightning is uncommon, so the amount of data which does not involve lightning tends to swamp the training process. Thus we only consider spatial locations at which lightning frequently occurs, and we also undersample the subset of the remaining data-points which are not associated with lightning. Results from support vector machines and random forests had equitable threat scores of 0.0885 and 0.0828, respectively. The ETS of results from SVMs can be increased to 0.1241 if temporal resolution is reduced by a factor of 2, and 0.1499 if spatial resolution is reduced by a factor of 3.
Intelligent prediction of surrounding rock deformation of shallow buried highway tunnel and its engineering application
2019, Tunnelling and Underground Space Technology
The potential arch crown settlement is one of the most hazardous factors in shallow-buried tunnel excavations. Therefore, accurate prediction of arch crown settlement range is essential to minimize the possible risk of damage. Considering the time series regression characteristics of deformation of surrounding rock in shallow-buried tunnels, the Support Vector Machine (SVM) information granulation method was newly applied in this study for deformation prediction of surrounding rock. First, obtain monitoring data of the tunnel arch crown settlement. Second, transform the data of three arch crown settlement into a triangular fuzzy particle. The three parameters, Low, R, and Up in the fuzzy particle represent the minimum, average and maximum value of the settlement of the arch crown in three days. Then, use the SVM to predict the Low, R, and Up values of the tunnel arch crown settlement. Finally, the established prediction model of surrounding rock with SVM information granulation method was applied to the Panlongshan tunnel on the line of the Qinglan expressway in China and prediction results agree well with practical situations, which means the method of SVM information granulation used in this study could provide relatively high accuracy when applied to deformation prediction of surrounding rock in shallow-buried tunnels. Meanwhile, the SVM information granulation method is simple, feasible and easy to implement. The presented method has been validated as an effective method of deformation prediction for surrounding rock, which also has good prospects for further engineering applications.
Application of machine learning to an early warning system for very short-term heavy rainfall
2019, Journal of Hydrology
The purpose of an early warning system (EWS) is to issue warning signals prior to extreme events. Extreme weather events, however, are hard to predict due to their chaotic behavior. This paper suggests a method for an effective EWS for very short-term heavy rainfall with machine learning techniques. The EWS produces a warning signal when it is expected to reach the criterion for a heavy rain advisory within the next 3 h. We devised a selective discretization method that converts a subset of continuous input variables to nominal ones. Meteorological data obtained from automatic weather stations are preprocessed by the selective discretization and principal component analysis. As a classifier, logistic regression is used to predict whether or not a warning is required. A comparative evaluation was performed on the EWS models generated by various classifiers. The tests were run for 652 locations in South Korea from 2007 to 2012. The empirical results showed that the preprocessing methods improved the prediction quality and logistic regression works well on heavy rainfall nowcasting in terms of F-measure and equitable threat score.
Integrating cluster analysis with granular computing for imbalanced data classification problem – A case study on prostate cancer prognosis
2018, Computers and Industrial Engineering
Analyzing imbalanced dataset is a critical and challenging task in data mining, since it requires special treatment for clusters with different sizes. Imbalance dataset commonly exists in some domains like medical problems. This study intends to propose a classification algorithm based on information granulation (IG) concept for handling imbalanced dataset. The proposed algorithm assembles data from majority classes into granules to balance the class ratio within the data. The proposed algorithm works in two stages. First stage generates a set of IGs using metaheuristics approaches which is a kind of automatic clustering algorithm including dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means). The next stage applies classification algorithm to classify the data. In this study, the proposed algorithm is verified using both balance and imbalanced benchmark datasets. Simulation results show that the proposed algorithms have promising classification results. Furthermore, this study also applies the proposed algorithms to prostate cancer prognosis classification problem. The algorithm is employed to predict survival rate of prostate cancer patients based on some medical data. The result shows that the proposed algorithms have lower error rate.
A new nearest neighbor classification method based on fuzzy set theory and aggregation operators
2017, Expert Systems with Applications
The Fuzzy Nearest Neighbor Classification (FuzzyNNC) has been successfully used, as a tool to deal with supervised classification problems. It has significantly increased the classification accuracy by considering the uncertainty associated with the class labels of the training patterns. Nevertheless, FuzzyNNC's limited methods fail to efficiently handle the imprecision in features measurement and the uncertainty induced by the choice of the distance measure and the number of neighbors in the decision rule. In this paper, we propose a new method called Fuzzy Analogy-based Classification (FABC) to tackle the FuzzyNNC limitations. In this work, we exploit the fuzzy linguistic modeling and approximate reasoning materials in order to endow FABC with intelligent capabilities, like imprecision tolerance, optimization, adaptability and trade-off. Hence, our approach is composed of two main steps. Firstly, we describe the domain features using fuzzy linguistic variables. Secondly, we define the classification process using two intelligent aggregation operators. The first one allows the optimization of the similarity evaluation, by defining the adequate features to be considered. The second one integrates a trade-off strategy within the decision rule, by using a global voting approach with compensation property. The integration of such mechanisms will increase the classification accuracy and make the FuzzyNNC approach more useful for classification problems where imprecision and uncertainty are unavoidable. The proposed FABC is validated on the most known datasets, representing various classification difficulties and compared to the many extensions of the FuzzyNNC approach. The results obtained show that our proposed FABC method can be adapted to different classification problems and improve the classification accuracy. Thus, the FABC has the best rank value against the comparison methods with high significant level. Moreover, we conclude that our optimized similarity and global voting rule are more robust to handle the uncertainty in the classification process than those used by the comparison methods.

View all citing articles on Scopus

View full text

Knowledge acquisition through information granulation for imbalanced data

Abstract

Introduction

Section snippets

Granular computing

Proposed methodologies

Evaluation of KAIG model

Implementation in imbalanced data

Conclusions

Acknowledgements

Journal of Manufacturing Systems

Neural Networks

Journal of Biomedical Informatics

Chemometrics and Intelligent Laboratory Systems

Granular computing: An introduction

A study of the behavior of several methods for balancing machine learning training data

SIGKDD Explorations

Information granulation via neural network-based learning

SMOTE: Synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Editorial: Special issue on learning from imbalanced data sets

SIGKDD Explorations

An introduction to support vector machines and other kernel-based learning methods

A multiple resampling methods for learning from imbalanced data sets

Computational Intelligence

A comparison of two approaches to data mining from imbalanced data

Lecture Notes in Computer Science