Keywords

1 Introduction

Granular computing (GC) [16] is a new paradigm in information processing. This term was coined to label a subset of Zadeh’s granular mathematics [17]. Among several Artificial Intelligence (AI) strategies, GC is used for knowledge discovery. When using data instead of knowledge, it is hard to work with the least amount of information possible without losing the quality of the solution. A clear example appears in problem solving via rough sets. In this case, it is of interest to define the lowest possible equivalence relation, that is, the equivalence relation that includes the minimum amount of attributes keeping the same quality with the use of approximate sets.

Rough Set Theory (RST) has proven to be effective to develop machine learning techniques [9, 10]. RST approaches based on multi-granulation (MG) start from the existence of different granulations determined by the relationships \( {\text{A}}_{ 1} ,{\text{ A}}_{ 2} , \ldots ,{\text{A}}_{\text{m}} \, \subseteq \,{\text{A}}_{\text{T}} \); which allows to have different perspectives of the data, where AT is the set of features, and the Ai are subsets of features called contexts. In reviewed papers [6, 9] on the combination of RST and MG (RST+MG), authors introduced those contexts Ai without any clear explanation about how they were determined. In the less common way to extract contexts, each Ai is clearly identified by experts in the application domain. In this research, we tackle this problem by building the contexts, which is indeed the most used option to solve this problem. Such variant is very appropriate, especially in domains with many predictive features, and in those where the contexts are not clear. This MG-based RST approach is similar to learning multi-views [15].

Unlike single-view learning, multi-views learning introduces a function to model a particular view and optimizes all functions at a time, exploiting redundant views over the input data. In such a configuration, each view may contain some knowledge that other views do not have. Multiple views can be used to describe data exhaustively and accurately [12]. In the review of the literature on learning from multi-views, this matter was found to be closely related to other topics of machine learning, such as active learning and ensemble learning. The idea of ensemble learning can be briefly depicted as the use of multiple learning models and combine their predictions [5, 12]. In addition, co-training is one of the oldest schemes for learning multi-views [12].

In this research, we propose a method to construct each Ai, which can be seen as multi-views. It uses a genetic algorithm (GA) and a measure of dependence between features; an ensemble algorithm similar to co-training is applied. However, our method differs from co-training in that it does not use any information provided by previous classifiers. In our ensemble algorithm, after separately obtaining the models of multiple views, an average probability vote is used to make the classification, which is the specific task of machine learning that we are dealing with. Classification is aim at inferring a function Ƒ: P → Y from a labeled training data {(P1, Y1),…, (Pm, Ym)} where Pi is a vector of values (the input) and Yi is a class value (the output).

The remainder of this paper elaborates aspects on the computational methodology, in Sect. 2. Section 3 introduces our method Genetic Algorithm and Rough Sets-based Multi-Granulation (GA-RS-MG). Section 4 presents the experimental framework and results, while conclusions and future work remarks are given in Sect. 5.

2 Computational Methodology

To assess resulting contexts, we built an ensemble algorithm based on MLP classifier. It reveals the much the generation of contexts benefits machine learning methods.

2.1 Rough Set Theory

RST is an efficient tool for data mining, suitable for discovering dependencies between data, discovering patterns, estimating data significance, reducing data and so forth [2, 4, 13]. In particular, it has been remarkably applied to the field of medicine [9]. RST aims at approximate any concept X ⊆ U (subset of the domain universe) by a pair of exact sets, the lower and upper approximations. The lower approximation B*(X) of a set X is defined as the collection of cases (objects of the universe U) whose equivalence classes [2, 13] are totally contained in X. The upper approximation B*(X) contains only those objects of U which belong to the equivalence classes that are at least partially contained in the set, i.e. generated by the inseparability relation containing at least one object x belonging to X. They are formally described as follows:

$$ {\text{B}}_{ *} \left( {\text{X}} \right) = \left\{ {{\text{x}} \in {\text{U | B}}\left( {\text{x}} \right) \; \subseteq \;{\text{X}}} \right\} $$
(1)
$$ {\text{B}}^{ *} \left( {\text{X}} \right) = \left\{ {{\text{x}} \in {\text{U | B}}\left( {\text{x}} \right) \cap {\text{X }} \ne\upphi } \right\} $$
(2)

The classic RST works with discrete data to define the separability between objects based on the strict equality between values. When data have features of continuous domains, these are discretized in order to obtain the degree of dependence through equivalence relations; otherwise, it would be necessary to use similarity relations.

Dependence Between Contexts and Decision Features.

Discovering dependencies between attributes is a key in data analysis [2]. Let B and D be subsets of set of attributes A in the information system (U, A). B ⇒ σ D denotes that D depends on B in a degree of σ (0 ≤ σ ≤ 1). D depends partially on B in case σ < 1, whereas B ⇒ D if the degree of dependence σ = 1, i.e. D depends totally on B, which happens when all values of the features in D are uniquely determined by the values of the features in B.

$$ \upsigma\left( {{\text{B}},{\text{D}}} \right) = \frac{{\left| {{\text{POS}}_{\text{B}} \left( {\text{D}} \right)} \right|}}{{\left| {\text{U}} \right|}} $$
(3)
$$ {\text{where}}\;\,{\text{POS}}_{\text{B}} \left( {\text{D}} \right) = \bigcup\nolimits_{{{\text{X}} \in^{{{\text{U}}/{\text{D}}}} }} {{\text{B}}_{ *} \left( {\text{X}} \right)} $$
(4)

2.2 Random Forest and Multilayer Perceptron

Random Forest (RF) [3] is a general method to build a set of L tree-based classifiers. The data set of each classifier includes a subset of variables. The number of trees in the forest and the number of variables in the subset should be set a priori. The number of subsets of variables is calculated as \( {\text{F}} = \log_{2} {\text{M}} + 1 \), where M is the number of attributes of the original data set. Each tree is built to its maximum depth and no pruning procedure is applied after that. The predicted class for any given example is determined by adding the predictions of the set of decision trees through a majority vote.

On the other hand, artificial neural networks (ANNs) are mathematical tools for modeling problems. They reveal functional relationships between the data in classification tasks, pattern recognition, regressions, etc. Applied ANN consists of Multilayer Perceptron (MLP) [11], ones of the most popular ANN models. As for training, MLP uses the backpropagation (BP) learning algorithm to adapt its computation function to the needs of each particular problem.

2.3 Problem Formulation

Usually, a decision system \( {\text{DS}}\, = \,({\text{U}},{\text{ A}} \cup \left\{ {\text{d}} \right\}) \) is seen as a single set, where A is the set of predictive features and d means the decision feature [1, 6, 8, 14]. However, in a DS it is possible to define different contexts (subsets of features Ai ⊂ A) that bear a certain relationship with d. Thus, such contexts reveal distinct viewpoints on the relationships between the predictive and decisive attributes. Several decision subsystems \( {\text{DS}}_{\text{i}} = \,\left( {{\text{U}},\,{\text{A}}_{\text{i}} \cup \{ {\text{d}}\} } \right) \) can be obtained by using different contexts Ai. Those contexts can emerge from any set of predictive feature in a natural way. Let us consider a DS with information of college students, where A includes few features on their social status, others about high school grades, others regarding entrance examination, and so forth. Each of those sets of features offers a unique outlook on the student.

In real-world scenarios, it is not easy to create proper contexts Ai from predictive features. Thus, it is needed to tackle the problem of creating suitable contexts to apply machine learning methods on them. That is indeed the problem that we approach in this paper by introducing a method to build contexts to be used in classification tasks.

3 Proposed Method

Genetic Algorithm and Rough Sets-based Multi-Granulation (GA-RS-MG) is the method that we propose to carry out a multi-granulation from the features viewpoint, in order to develop context-based machine learning methods. Our method bases on a GA to automatically determine the contexts of each DS. It uses a generational model with elitist replacement: the fittest individual of the previous population survives to current one. In this specific case, chromosomes (individuals) have as many genes as predictive features exist in the DS. Chromosomes have a binary representation, where value 1 indicates that the corresponding feature is selected, and it is set to 0 if not.

We used the measure of dependence described by Eq. (3) as fitness function, where each chromosome represents a context. For each decision class Di, B*(Di) is calculated by Eq. (1). The overall sum of the number of objects in each lower approximation is divided by the cardinality of the universe U, which in this case is the number of instances in the DS. In this way, we evaluate the degree of dependence of each chromosome with respect to the decision feature. In the last GA population, among the most mutually dependent individuals, the most different ones are selected.

Pseudocode in Table 1 describes the algorithmic basis of GA-RS-MG. Line 1 initializes the population: every gene value is set to 1 or 0 with a probability of 0.5. Then, the fitness of each individual is evaluated. A maximum of 100 iterations is used as stop condition for the evolutionary loop (line 2), and the population size s = 44. We set pc = 0.7 as crossover probability and pm = 0.09 as mutation probability. The election of values for all the aforementioned parameters has an empirical nature.

Table 1. Genetic Algorithm and Rough Sets-based Multi-Granulation (GA-RS-MG).

In line 16, the best current individual replaces the worst descendant. We figure the number of contexts (line 19) according to a uniform distribution between a minimum level of 3 individuals and a maximum level of one third of the population size (s/3 individuals). We select the best individuals by the degree of dependence (line 20); we prefer those being more dependent but also more different from each other. Two given individuals (chromosomes) are distinct if they differ at any i-th position (gene). We try to obtain best contexts at once; we intend to assure that they are different regarding predictive features. Finally, selected best individuals become into output data set.

4 Results and Discussion

For experiment, we used data sets (see Table 2) from the University of California at Irvine (UCI) repository. The degree of dependence between contexts by GA-RS-MG and the decision feature satisfies the condition σ(Ai, d) ≥ 0.75. Contexts found for each single DS do not have the same number of attributes. We assessed the suitability of such contexts by applying MLP to discover knowledge on them. For MLP and RF we used the default parametric setups given in WEKA data-mining tool, version 3.8.

Table 2. Benchmark decision systems: description and generated contexts.

To assure statistical robustness we made a 10-folds cross-validation with one run per DS. Besides, the built model of each original DS was added to a voting algorithm along with the remaining models of the corresponding contexts by GA-RS-MG. Such an ensemble algorithm has MLP as base classifier. Table 3 shows results (for weighted precision (WP) and mean absolute error (MAE) evaluation measures) achieved by the proposed ensemble method with MLP as base classifier (VoteMLP), as well as by MLP and RF. Highlighted () values are the overall best results for each DS. VoteMLP exhibits the one with the best results regarding WP.

Table 3. Experimental results.

Figure 1 depicts the mean and the standard deviation of the classification algorithms, concerning both WP and MAE. VoteMLP and MLP reached the best performance. To detect significant differences in the group of methods, we applied Friedman test on results for WP and MAE. Table 4 shows the average ranking of each method. The p-values computed by the Friedman test were 0.2636 and 4.1770E−05, for the case of WP and MAE respectively. According to that, there are significant differences among the three algorithms regarding the MAE, for a level of significance α = 0.05. Consequently, in a post-hoc stage, we applied the Holm test in order to detect significant differences between all pairs of algorithms, as recommended in [7].

Fig. 1.
figure 1

Box plot of evaluation measures.

Table 4. Friedman test and Holm test.

Table 4 shows the adjusted p-values by Holm test for each pair of methods in the comparison hypotheses. Results reveal significant differences between any pair of algorithms, for a level of significance α = 0.05. Regarding weighted precision, MLP is outperformed by RF, which is in turn outperformed by VoteMLP. However, none of such superiorities is statistically significant. In contrast, while comparing them in terms of mean absolute error, MLP significantly outperforms both VoteMLP and RF, and the poorest results belong to RF, also significantly outperformed by VoteMLP. Considering all the aforementioned, MLP and VoteMLP exhibit the best performance.

5 Conclusions

We propose Genetic Algorithm and Rough Sets-based Multi-Granulation, which creates granules (contexts) based on a GA. Each context must fulfill a certain degree of dependence with respect to the decision feature. Besides, obtained models are simpler and more precise as a whole. We consider models of contexts and the original DS for classification by an ensemble. The proposed method shows suitable results, statistically assessed by comparing the performance of MLP, VoteMLP and RF classifiers.

VoteMLP was superior to both MLP and RF in terms of weighted precision, while RF outperformed MLP, but in any case significantly. In addition, regarding the mean absolute error, VoteMLP significantly outperforms RF, and MLP outperforms them both. The results of the proposed approach are comparable with RF. Even in the case when RF performs better, our method has an extra advantage: RF uses 100 trees to find a solution but our method creates as many trees as built contexts (see Table 2). That is a good point to assess not only its effectiveness but also its efficiency.