Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic
Introduction
In real world applications, in particular in the industrial and medical field, tricky binary classification problems are frequently found, which are difficult to solve because one of the classes to point out is poorly represented in the database due either to the particular nature of the problem or to the characteristics of available data. The rarity of certain patterns especially when combined to their low separability from the rest of the instances can make the classification task related to their identification very hard. Malfunctionings detection often belongs to this class of problems, since normally malfunctions are very rare with respect to normal operating conditions thus the datasets collected when monitoring an industrial process mainly contain observations on the normal status of the process itself and only a few data are related to abnormal situations. This kind of unbalanced datasets compromises the classification abilities due to the little number of instances of rare events in the database which can be used in the learning phase of the classifier. A part from quite rare cases, where there is a very clear difference between the classes, it is not easy to define a boundary to separate the different classes and in particular to point out the samples belonging to the less numerous classes. Moreover, the presence of the noise usually creates more problems by further decreasing the generalization ability of the classifiers. On the other hand, in applications such as fault detection [1], [2] and diseases diagnosis [3], [4], the crucial task is actually the detection of abnormal situations and thus the data belonging to the less numerous class are actually those conveying the interesting information. For the sake of simplicity in the rest of this work the class of unfrequent patterns, whose correct detection is the main aim of the classification task, is identified by the classunf abbreviation while the class of frequent patterns is indicated as classfre.
When coping with unbalanced datasets, traditional classifiers such as induction trees and multilayer perceptrons do not achieve good performance as they are designed to optimize the overall performance without taking into account the relative distributions of each class [5]. As a result these classifiers tend to ignore classunf samples and classify accurately only the classfre ones. There are two families of approaches for dealing with this problems: the external and internal one. The internal approaches consist of the creation of new algorithms expressly designed for solving the unbalanced sets problem while external problems use traditional algorithms but resample the data used by such algorithms in order to reduce the effect of the data unbalance.
In the literature it is possible to find many works related to the classification of unbalanced datasets. The external approaches are deeply investigated in [6] where they are tested on a text analysis application, and in [7] they are combined with a Support Vector Machines (SVMs) ensemble. In [8] such techniques are tested on many benchmark applications and in [9] they are applied to a radar image classification problem. Internal methods have been developed as well; among them rectangular basis functions (RecBF) and fuzzy points have been deeply investigated in [10], [11], [12], [13], [14] in which the rectangular basis functions based method is proposed and tested on some benchmark problem. Several approaches have been done using SVMs and their evolutions. In particular in [15] multiple SVMs (one for each class) were proposed for unbalanced datasets classification. In [16] modified SVM is used for the same purpose. Both these works test the proposed methods on widely used machine learning problems.
The present paper describes an internal method based on the joint use of a particular kind of neural network, the self organizing map (SOM), and fuzzy logic to approach binary classification problems with markedly imbalanced datasets, which is named LASCUS (LAbeled Som Classification Unbalanced Sets). The LASCUS methods couples a data clustering operation and a subsequent labeling of the obtained clusters, creating a structure which can be used for classification purposes.
The main idea exploited by LASCUS is to create, by means of the SOM, a set of centroids for the input pattern and then to identify among these centroids those ones which are referred by a relatively high density of the samples belonging to classunf on the basis of the resemblance between input pattern and cluster. By respect to other methods the idea that should lead LASCUS to the detection of unfrequent events is that it identifies a cluster as representing classunf patterns not on the basis of the absolute rate of such situations in the neighborhood of a cluster but of the relative rate. Such rate is thresholded in order to determine the sensitivity of the classifier to classunf patterns by means of a fuzzy inference system working according to a set of rules defined in order to facilitate the identification of a high percentage of unfrequent situations. The LASCUS training process can be thus subdivided into three subsequent steps: the first one is devoted to the clusterization of input patterns, the second one to the calculation of the classunf densities for each obtained cluster and a final one where a suitable sensitivity threshold is chosen and binary labels are assigned to each cluster according to the selected threshold. The first two steps of this process are described in detail in Section 2.1 while the third one is described in Section 2.2.
The proposed method has been successfully applied in three practical cases, the first one referring to the medical field while the others are extracted from the industrial context. The Wisconsin breast cancer database [17] is one of the most used databases for the evaluation of the efficiency of machine learning algorithms. The database consists of a set of observations of features extracted from biopsy images and an additional field stating the nature of the examined tumor (malignant or benign). It is clear that in this problem it is fundamental to correctly identify malignant tumors. The so-called clogging prediction problem, which comes from the steel-making industry and is described in Section 3.2, perfectly fits this set of tricky classification problems: clogging is relatively rare (1% of situations) and affects both product quality and productivity. The clogging occurrence detection is a crucial point of the casting process and, if identified, it can be avoided by taking proper countermeasures. Finally a third classification problem, also related to data coming from an industrial application dealing with metal products quality prediction, which is not binary, because more than two classes need to be distinguished, but one of them is less frequent and more important that the other ones: in this case the proposed approach is successfully exploited in parallel to a traditional classifier.
The paper is organized as follows: the LASCUS method is described in detail in Section 2 while Section 3 discusses the application of LASCUS to the practical problems. A comparison is made between LASCUS and other classification algorithms by means of the obtained numerical results. Finally Section 4 provides several final considerations and concluding remarks.
Section snippets
The LASCUS method
Neural networks are very frequently and successfully applied to classifications problems. However, when coping with clustering problems marked by a not-uniform class distribution, traditional algorithms such as feed-forward neural networks and Radial Basis Function (RBF) networks [18] specialize themselves in classifying patterns belonging to classfre with a detrimental effect to classunf patterns; this is due to the fact that clusters corresponding to unfrequent classes are built on the basis
Practical use and results
The LASCUS method has been tested on a database which is commonly used for the validation of machine learning algorithms, the Wisconsin breast cancer database (WBCDB), provided by the UCI database repository. This problem and the results obtained in this framework by the method are described in Section 3.1. LASCUS has been also tested on several industrial problems. In this paragraph the results obtained by this method when used for the clogging prediction are shown in Section 3.2 together with
Conclusions
In this paper a new classification method has been described. This method, called LASCUS, is based on the use of a self organizing map and on fuzzy logic and is particularly suitable for those classification problems characterized by uneven datasets and where it is important to correctly detect unfrequent events which, in some industrial or medical applications, can correspond to dangerous situations.
The proposed method has been tested on the Wisconsin breast cancer database which is widely
References (33)
- et al.
Constructing Fuzzy graphs from examples
Intelligence Data Analysis
(1999) - et al.
ARTMAP: Supervised real-time learning and classification of non stationary data by self-organising neural networks
Neural Networks
(1991) - et al.
An experiment in linguistic synthesis with a fuzzy logic controller
International Journal of Man-Machine Studies
(1975) - K.L. Butler, J.A. Momoh, A neural net based approach for fault diagnosis in distribution networks, Power Engineering...
- G. Shreekant, Y. Bin, P. Meckl, Fault detection for nonlinear systems in presence of input unmodeled dynamics, Advanced...
- N. Stepenosky, R. Polikar, J. Kounios, C. Clark, Ensemble techniques with weighted combination rules for early...
- R. Haga, Y. Mitsukura, M. Fukumi, N. Akamatsu, M. Yasutomo, Automatic detection of left ventricular asynergy by fuzzy...
- A. Estabrooks, A combination scheme for inductive learning from imbalanced datasets, MCS thesis, Faculty of computer...
- et al.
A multiple resampling method for learning from imbalanced data sets
Computational Intelligence
(2004) - Y. Liu, A. Anand, X. Huang, Boosting prediction accuracy on imbalanced datasets with SVM ensembles, Lecture Notes in...
Learning from imbalanced datasets with boosting and data generation: the databoost approach
SIGKDD Explorations
Machine learning for the detection of oil spills in satellite radar images
Machine Learning
Introduction of mixed fuzzy rules
International Journal of Fuzzy Systems
Adapting Fuzzy points for very-imbalanced datasets
Cited by (39)
Use of deep neural networks for clogging detection in the Submerged Entry Nozzle of the continuous casting
2024, Expert Systems with ApplicationsA self-adaptive class-imbalance TSK neural network with applications to semiconductor defects detection
2018, Information SciencesCitation Excerpt :The review is also inclusive of the utilization of the rule-based methods to address different aspects of data complexity under this class imbalance condition. A rule-based classifier, namely Labeled SOM Classification Unbalanced Sets (LASCUS) [43] was designed from the approach of algorithm modification. LASCUS was a fusion between a self organizing map (SOM) and a Mamdani fuzzy inference system (MFIM).
Machine Learning applied to prediction of shape defects in round cross section rolled bars
2023, Metallurgia ItalianaAdaptive fuzzy-region growing fusion and improved CNN-ANFIS-based automated segmentation and classification of cervical cancer
2023, Concurrency and Computation: Practice and ExperienceMulti-objective optimization of powder-mixed EDM parameters using hybrid Grey-ANFIS artificial intelligence technique
2022, International Journal on Interactive Design and ManufacturingFuzzy-AI model and big data exploration: A methodological philosophy in solving problems in digital era
2022, Fuzzy-AI Model and Big Data Exploration: A Methodological Philosophy in Solving Problems in Digital Era