Elsevier

Neurocomputing

Volume 133, 10 June 2014, Pages 342-357
Neurocomputing

A niching genetic programming-based multi-objective algorithm for hybrid data classification

https://doi.org/10.1016/j.neucom.2013.12.048Get rights and content

Abstract

This paper introduces a multi-objective algorithm based on genetic programming to extract classification rules in databases composed of hybrid data, i.e., regular (e.g. numerical, logical, and textual) and non-regular (e.g. geographical) attributes. This algorithm employs a niche technique combined with a population archive in order to identify the rules that are more suitable for classifying items amongst classes of a given data set. The algorithm is implemented in such a way that the user can choose the function set that is more adequate for a given application. This feature makes the proposed approach virtually applicable to any kind of data set classification problem. Besides, the classification problem is modeled as a multi-objective one, in which the maximization of the accuracy and the minimization of the classifier complexity are considered as the objective functions. A set of different classification problems, with considerably different data sets and domains, has been considered: wines, patients with hepatitis, incipient faults in power transformers and level of development of cities. In this last data set, some of the attributes are geographical, and they are expressed as points, lines or polygons. The effectiveness of the algorithm has been compared with three other methods, widely employed for classification: Decision Tree (C4.5), Support Vector Machine (SVM) and Radial Basis Function (RBF). Statistical comparisons have been conducted employing one-way ANOVA and Tukey’s tests, in order to provide reliable comparison of the methods. The results show that the proposed algorithm achieved better classification effectiveness in all tested instances, what suggests that it is suitable for a considerable range of classification applications.

Introduction

Rule mining is one of the steps for knowledge discovery in databases (KDD) [15]. It has been studied in a wide range of practical applications in which such information can be useful. The objective of rule mining is to identify rules that can be used for classifying or identifying data. In the literature, several approaches are being employed for extracting non-trivial knowledge from databases. Amongst these approaches, it is possible to cite unsupervised learning algorithms [11] for classification rule building, such as k-nearest neighbors (KNN) [47], [46], [21] and Bayesian classifiers [3], [6], [38], [40] and also supervised algorithms, like decision trees (DT) [8], [40], [22], artificial neural networks (ANN) [26], [2], [29], [19], [20] and support vector machines (SVM) [42], [9], [27], [30]. Furthermore, there are works that propose evolutionary-based tools for handling with classification problems [16], such as genetic algorithms (GA) [35], genetic programming (GP) [4], artificial immune systems [1], ant colony algorithms [34] and particle swarm optimization.

Algorithms that are capable of directly handling hybrid databases, without data preprocessing, have not been found in the literature. A database is said hybrid when it is composed of conventional attributes (e.g. numerical, textual, and logical) and unconventional attributes (e.g. geographical). Usually, the algorithms that deal with hybrid data adopt some alternative structure for representing the unconventional attributes as conventional ones. For instance, the works [49], [25], [48], which deal with geographic data, employ particular schemes for representing the geographic entities. In general cases, these alternative representations are not desirable, since the performance of the classification algorithm and the interpretation of the results become highly dependent on the representation scheme adopted. Actually, there are many geographic databases that use a pattern model to represent and store the attributes [12], [13]. These databases use functions that are well known [12], but just a few classification algorithms explore this rich set of functions [5]. Moreover, the existing methods require data to be preprocessed, and they are not able to mix conventional and unconventional functions in the same classification rule.

Genetic programming is a class of optimization algorithms that generates a population of individuals and evolves them along a generational process [24]. It employs genetic operators inspired by nature, like the ones employed in traditional GAs (selection, crossover, mutation, etc.), but they differ in their structure. If a proper evaluation mechanism is adopted, genetic programming becomes able to deal with complex classification problems, including the possibility of handling with multiple objectives simultaneously.

This paper proposes a multi-objective genetic programming algorithm that has been specially designed to classify conventional and hybrid data in real world databases. An improved niche strategy [18] and a population archive are used to increase the ability of extracting suitable rules for all problem classes present in the database. The algorithm is also very flexible; in such a way that the user can choose the function set that should be employed to build the rules, regardless the attribute types (numerical, textual, geographical, etc.). This characteristic makes the proposed approach adaptable to virtually any kind of problem, without structural changes in the algorithm. A preliminary version of the proposed algorithm can be found in [36], in which the flexibility of the algorithm is illustrated in spatial data mining problems.

The efficiency of the algorithm is measured using the following data sets that are detailed in Section 3:

  • Hepatitis and wine, both from UCI machine learning repository. These data sets are composed of regular attributes (numerical and textual) and the instances are classified into two and three classes respectively.1

  • Digital gases analysis (DGA) problem, where three real data sets of power transformers are classified into three classes [35].

  • City Development, in which cities are classified into three classes. This data set is composed of conventional (categorical/numerical) and geographical (points, lines and polygons) attributes.2

The algorithm proposed here is employed to maximize classification effectiveness and to minimize rule size. It has been conceived to handle a wide range of real world classification problems. Furthermore, the characteristics of the algorithm suggest that it is well suited for dealing with unbalanced data sets (i.e., data sets in which the number of samples in some class is considerably higher than in the others). Problems with unbalanced data sets are notoriously harder to solve [33], especially when more than two classes are involved [31].

Finally, it is important to emphasize that all problems considered in this work have been modeled in such a way that other classical algorithms could be used and their performances could be assessed to establish a fair comparison with the proposed algorithm. For that purpose, three different classification assessment strategies to select and to test the obtained rules have been proposed.

This paper is structured as follows. The proposed algorithm is described in Section 2. Results achieved by the proposed algorithm for the considered problems are presented in Section 3, along with comparisons to the results of three classical approaches. Finally, Section 4 presents the conclusions that could be drawn from the study.

Section snippets

Algorithm modeling

The possible solution or individual in our MOGP (multi-objective genetic programming) algorithm is represented by a Boolean predicate, defined in the same manner as the WHERE clause of the structured query language’s (SQL) SELECT statement [14]. The adequacy of the individual to the problem is assessed by using a vector of objective functions, which determines its selection probability.

The classification problem is modeled as a bi-objective optimization problem, in which the objectives are (1)

Parameters configuration

The proposed algorithm has been set with the following parameters:

  • Mutation rate: 2%.

  • Crossover rate: 80%.

  • Elitism per niche: 10% of the best individuals.

  • Population archive size: 100 individuals.

The population size and the number of generations used in each problem are given in Table 2.

These values were defined based on preliminary tests. Several parameter sets were tested during the implementation of the algorithm to identify the configurations that were able to provide reasonable results without

Conclusion

In this paper, a new algorithm that can extract hybrid rules with three different classification approaches is proposed. The algorithm employs a niche technique, mutation and crossover probabilities per niche and an archive population to generate individuals that provide good classification of data from all classes of a given problem. The classification problem has been modeled with two objectives, effectiveness and complexity, in order to obtain rules that correctly classify data and that

Acknowledgments

This work has been supported in part by CNPq, CAPES and FAPEMIG, Brazilian agencies are in charge of fostering scientific and technological development.

Marconi de Arruda Pereira received the Ph.D. degree in electrical engineering from the Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2012. From March 2008 to February 2013, he was a professor at Federal Center of Technological Education of Minas Gerais (CEFET–MG), Belo Horizonte. Since March 2013, he has been an assistant professor at the Federal University of São João del Rei and a master student supervisor in the Federal Center of Technological Education of Minas

References (50)

  • P.A. Whigham

    Induction of a marsupial density model using genetic programming and spatial relationships

    Ecol. Modell.

    (2000)
  • R.T. Alves; M.R. Delgado; H.S. Lopes; A.A. Freitas, An artificial immune system for fuzzy-rule induction in data...
  • M.B. Araújo et al.

    Validation of species-climate impact models under climate change

    Global Change Biol.

    (2005)
  • V. Bogorny, A.T. Palma, P.M. Engel, L.O. Alvares, Weka-GDPM – integrating classical data mining toolkit to geographic...
  • E.G. Carrano et al.

    A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms

    IEEE Trans. Evolut. Comput.

    (2011)
  • Y. Chen et al.

    Support vector learning for fuzzy rule-based classification systems

    IEEE Trans. Fuzzy Syst.

    (2003)
  • P.R. Cohen

    Empirical Methods for Artificial Intelligence

    (1995)
  • K. Dembczyński et al.

    ENDER: a statistical framework for boosting decision rules

    Data Min Knowl. Discov.

    (2010)
  • M.A. Egenhofer

    A model for detailed binary topological relationships

    Geomatica

    (1993)
  • M. Ester et al.

    Spatial data mining: database primitives, algorithms and efficient DBMS support

    Data Min. Knowl. Discov.

    (2000)
  • R. Elmasri et al.

    Fundamentals of Database Systems

    (2003)
  • U.M. Fayyad et al.

    Advances in Knowledge Discovery and Data Mining

    (1996)
  • A.A. Freitas

    Data Mining and Knowledge Discovery with Evolutionary Algorithms, Natural Computing Series

    (2002)
  • GeoMINAS – Programa Integrado de Uso da Tecnologia de Geoprocessamento pelos Órgãos do Estado de Minas Gerais....
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization and Machine Learning

    (1989)
  • Cited by (0)

    Marconi de Arruda Pereira received the Ph.D. degree in electrical engineering from the Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2012. From March 2008 to February 2013, he was a professor at Federal Center of Technological Education of Minas Gerais (CEFET–MG), Belo Horizonte. Since March 2013, he has been an assistant professor at the Federal University of São João del Rei and a master student supervisor in the Federal Center of Technological Education of Minas Gerais. His research interests include classification systems, intelligent systems.

    Clodoveu Augusto Davis Junior received his B.S. degree in Civil Engineering in 1985 from the Federal University of Minas Gerais (UFMG), Brazil. He obtained M.Sc. and Ph.D. degrees in Computer Science, also from UFMG, in 1992 and 2000, respectively. He led the team that conducted the implementation of GIS technology in the city of Belo Horizonte, Brazil, and coordinated several geographic application development efforts. Currently, he is a professor and researcher at the Federal University of Minas Gerais. His main research interests include spatial data infrastructures, geographic databases, urban GIS, spatial data infrastructures, and multiple representations in GIS.

    Eduardo Gontijo Carrano received the Ph.D. degree in electrical engineering from the Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, in 2007. From May 2006 to May 2007, he was a visiting student with the Center for Intelligent Systems, Universidade do Algarve, Faro, Portugal. From September 2007 to January 2008, he was a post-doctoral researcher with the Department of Mathematics, UFMG. From January 2008 to June 2011, he was an assistant professor with Centro Federal de Educação Tecnológica de Minas Gerais (CEFET–MG), Belo Horizonte. Since June 2011, he has been an assistant professor with the Department of Electrical Engineering, UFMG. His current research interests include network design, combinatorial optimization, evolutionary algorithms, multiobjective optimization, optimization theory, algorithm evaluation, and dynamic systems.

    João A. Vasconcelos received the B.Sc. degree in electrical engineering from the Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil, in 1982; the M.Sc. degree in electrical engineering from Universidade Federal da Paraíba in 1985; and the Ph.D. degree in electrical engineering from the École Centrale de Lyon, France, in 1994. Currently, he is a senior professor at the Electrical Engineering Department in UFMG. Most of his research work includes single and multiobjective evolutionary computation (optimization on design, maintenance and operation of systems), computational electromagnetics (boundary element method, finite element method, moment method) and decision making.

    View full text