A niching genetic programming-based multi-objective algorithm for hybrid data classification
Introduction
Rule mining is one of the steps for knowledge discovery in databases (KDD) [15]. It has been studied in a wide range of practical applications in which such information can be useful. The objective of rule mining is to identify rules that can be used for classifying or identifying data. In the literature, several approaches are being employed for extracting non-trivial knowledge from databases. Amongst these approaches, it is possible to cite unsupervised learning algorithms [11] for classification rule building, such as k-nearest neighbors (KNN) [47], [46], [21] and Bayesian classifiers [3], [6], [38], [40] and also supervised algorithms, like decision trees (DT) [8], [40], [22], artificial neural networks (ANN) [26], [2], [29], [19], [20] and support vector machines (SVM) [42], [9], [27], [30]. Furthermore, there are works that propose evolutionary-based tools for handling with classification problems [16], such as genetic algorithms (GA) [35], genetic programming (GP) [4], artificial immune systems [1], ant colony algorithms [34] and particle swarm optimization.
Algorithms that are capable of directly handling hybrid databases, without data preprocessing, have not been found in the literature. A database is said hybrid when it is composed of conventional attributes (e.g. numerical, textual, and logical) and unconventional attributes (e.g. geographical). Usually, the algorithms that deal with hybrid data adopt some alternative structure for representing the unconventional attributes as conventional ones. For instance, the works [49], [25], [48], which deal with geographic data, employ particular schemes for representing the geographic entities. In general cases, these alternative representations are not desirable, since the performance of the classification algorithm and the interpretation of the results become highly dependent on the representation scheme adopted. Actually, there are many geographic databases that use a pattern model to represent and store the attributes [12], [13]. These databases use functions that are well known [12], but just a few classification algorithms explore this rich set of functions [5]. Moreover, the existing methods require data to be preprocessed, and they are not able to mix conventional and unconventional functions in the same classification rule.
Genetic programming is a class of optimization algorithms that generates a population of individuals and evolves them along a generational process [24]. It employs genetic operators inspired by nature, like the ones employed in traditional GAs (selection, crossover, mutation, etc.), but they differ in their structure. If a proper evaluation mechanism is adopted, genetic programming becomes able to deal with complex classification problems, including the possibility of handling with multiple objectives simultaneously.
This paper proposes a multi-objective genetic programming algorithm that has been specially designed to classify conventional and hybrid data in real world databases. An improved niche strategy [18] and a population archive are used to increase the ability of extracting suitable rules for all problem classes present in the database. The algorithm is also very flexible; in such a way that the user can choose the function set that should be employed to build the rules, regardless the attribute types (numerical, textual, geographical, etc.). This characteristic makes the proposed approach adaptable to virtually any kind of problem, without structural changes in the algorithm. A preliminary version of the proposed algorithm can be found in [36], in which the flexibility of the algorithm is illustrated in spatial data mining problems.
The efficiency of the algorithm is measured using the following data sets that are detailed in Section 3:
- •
Hepatitis and wine, both from UCI machine learning repository. These data sets are composed of regular attributes (numerical and textual) and the instances are classified into two and three classes respectively.1
- •
Digital gases analysis (DGA) problem, where three real data sets of power transformers are classified into three classes [35].
- •
City Development, in which cities are classified into three classes. This data set is composed of conventional (categorical/numerical) and geographical (points, lines and polygons) attributes.2
The algorithm proposed here is employed to maximize classification effectiveness and to minimize rule size. It has been conceived to handle a wide range of real world classification problems. Furthermore, the characteristics of the algorithm suggest that it is well suited for dealing with unbalanced data sets (i.e., data sets in which the number of samples in some class is considerably higher than in the others). Problems with unbalanced data sets are notoriously harder to solve [33], especially when more than two classes are involved [31].
Finally, it is important to emphasize that all problems considered in this work have been modeled in such a way that other classical algorithms could be used and their performances could be assessed to establish a fair comparison with the proposed algorithm. For that purpose, three different classification assessment strategies to select and to test the obtained rules have been proposed.
This paper is structured as follows. The proposed algorithm is described in Section 2. Results achieved by the proposed algorithm for the considered problems are presented in Section 3, along with comparisons to the results of three classical approaches. Finally, Section 4 presents the conclusions that could be drawn from the study.
Section snippets
Algorithm modeling
The possible solution or individual in our MOGP (multi-objective genetic programming) algorithm is represented by a Boolean predicate, defined in the same manner as the WHERE clause of the structured query language’s (SQL) SELECT statement [14]. The adequacy of the individual to the problem is assessed by using a vector of objective functions, which determines its selection probability.
The classification problem is modeled as a bi-objective optimization problem, in which the objectives are (1)
Parameters configuration
The proposed algorithm has been set with the following parameters:
- •
Mutation rate: 2%.
- •
Crossover rate: 80%.
- •
Elitism per niche: 10% of the best individuals.
- •
Population archive size: 100 individuals.
The population size and the number of generations used in each problem are given in Table 2.
These values were defined based on preliminary tests. Several parameter sets were tested during the implementation of the algorithm to identify the configurations that were able to provide reasonable results without
Conclusion
In this paper, a new algorithm that can extract hybrid rules with three different classification approaches is proposed. The algorithm employs a niche technique, mutation and crossover probabilities per niche and an archive population to generate individuals that provide good classification of data from all classes of a given problem. The classification problem has been modeled with two objectives, effectiveness and complexity, in order to obtain rules that correctly classify data and that
Acknowledgments
This work has been supported in part by CNPq, CAPES and FAPEMIG, Brazilian agencies are in charge of fostering scientific and technological development.
Marconi de Arruda Pereira received the Ph.D. degree in electrical engineering from the Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2012. From March 2008 to February 2013, he was a professor at Federal Center of Technological Education of Minas Gerais (CEFET–MG), Belo Horizonte. Since March 2013, he has been an assistant professor at the Federal University of São João del Rei and a master student supervisor in the Federal Center of Technological Education of Minas
References (50)
- et al.
Democracy in neural nets: voting schemes for classification
Neural Netw.
(1994) - et al.
GP-COACH: genetic programming-based learning of COmpact and ACcurate fuzzy rule-based classification systems for high-dimensional problems
Inf. Sci. (NY)
(2010) - et al.
Using Bayesian networks with rule extraction to infer the risk of weed infestation in a corn-crop
Eng. Appl. Artif. Intell.
(2009) - et al.
Using decision trees to summarize associative classification rules
Expert Syst. Appl.
(2009) - et al.
Extraction of similarity based fuzzy rules from artificial neural networks
Int. J. Approx. Reason.
(2006) - et al.
Comprehensible credit scoring models using rule extraction from support vector machines
Eur. J. Oper. Res.
(2007) - et al.
Troika – an improved stacking schema for classification tasks
Inf. Sci. (NY)
(2009) On detecting nonlinear patterns in discriminant problems
Inf. Sci. (NY)
(2006)- et al.
Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence
Pattern Recognit.
(2006) - et al.
An order-clique-based approach for mining maximal co-locations
Inf. Sci. (NY)
(2009)
Induction of a marsupial density model using genetic programming and spatial relationships
Ecol. Modell.
Validation of species-climate impact models under climate change
Global Change Biol.
A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms
IEEE Trans. Evolut. Comput.
Support vector learning for fuzzy rule-based classification systems
IEEE Trans. Fuzzy Syst.
Empirical Methods for Artificial Intelligence
ENDER: a statistical framework for boosting decision rules
Data Min Knowl. Discov.
A model for detailed binary topological relationships
Geomatica
Spatial data mining: database primitives, algorithms and efficient DBMS support
Data Min. Knowl. Discov.
Fundamentals of Database Systems
Advances in Knowledge Discovery and Data Mining
Data Mining and Knowledge Discovery with Evolutionary Algorithms, Natural Computing Series
Genetic Algorithms in Search, Optimization and Machine Learning
Cited by (0)
Marconi de Arruda Pereira received the Ph.D. degree in electrical engineering from the Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2012. From March 2008 to February 2013, he was a professor at Federal Center of Technological Education of Minas Gerais (CEFET–MG), Belo Horizonte. Since March 2013, he has been an assistant professor at the Federal University of São João del Rei and a master student supervisor in the Federal Center of Technological Education of Minas Gerais. His research interests include classification systems, intelligent systems.
Clodoveu Augusto Davis Junior received his B.S. degree in Civil Engineering in 1985 from the Federal University of Minas Gerais (UFMG), Brazil. He obtained M.Sc. and Ph.D. degrees in Computer Science, also from UFMG, in 1992 and 2000, respectively. He led the team that conducted the implementation of GIS technology in the city of Belo Horizonte, Brazil, and coordinated several geographic application development efforts. Currently, he is a professor and researcher at the Federal University of Minas Gerais. His main research interests include spatial data infrastructures, geographic databases, urban GIS, spatial data infrastructures, and multiple representations in GIS.
Eduardo Gontijo Carrano received the Ph.D. degree in electrical engineering from the Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2007. From May 2006 to May 2007, he was a visiting student with the Center for Intelligent Systems, Universidade do Algarve, Faro, Portugal. From September 2007 to January 2008, he was a post-doctoral researcher with the Department of Mathematics, UFMG. From January 2008 to June 2011, he was an assistant professor with Centro Federal de Educação Tecnológica de Minas Gerais (CEFET–MG), Belo Horizonte. Since June 2011, he has been an assistant professor with the Department of Electrical Engineering, UFMG. His current research interests include network design, combinatorial optimization, evolutionary algorithms, multiobjective optimization, optimization theory, algorithm evaluation, and dynamic systems.
João A. Vasconcelos received the B.Sc. degree in electrical engineering from the Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, in 1982; the M.Sc. degree in electrical engineering from Universidade Federal da Paraíba in 1985; and the Ph.D. degree in electrical engineering from the École Centrale de Lyon, France, in 1994. Currently, he is a senior professor at the Electrical Engineering Department in UFMG. Most of his research work includes single and multiobjective evolutionary computation (optimization on design, maintenance and operation of systems), computational electromagnetics (boundary element method, finite element method, moment method) and decision making.