1 Introduction

Subgroup Discovery (SD) [1, 9, 16] is a pattern recognition task that identifies subsets descriptions of a dataset that show different behavior with respect to certain interestingness criteria. SD searches local pattern generally in the form of rules, where the body has constrains applied to data and the head represents the best supported class. Different approaches have been presented to discovery subgroup sets [3, 8].

Most of these papers evaluate their proposals with an experimental study that only uses the estimation of some quality measures through a 10-fold cross-validation. To summarize the results, the average for all subgroups mined in each fold are computed and then the average of the results obtained by all partition is calculated. Notice that the average is highly sensitive to extreme values, and it might hide important differences between miners that returns the same averaged value. Common SD quality measures are: unusualness or weighted relative accuracy, sensitivity or recall, and confidence or precision [5].

In typical comparison methodology, subgroup’s redundancy is ignored, even when it introduces errors on the computation of the average metrics. Subgroups are represented as patterns, and the redundant ones may present similar quality when they cover overlapped sets of instances in a dataset. On the other hand, this methodology does not consider the individual quality of the subgroups mined and the similarity between the sets obtained by the different algorithms.

This paper proposes a new method to evaluate and compare SD algorithms taking into account the redundancy, quality and similarity between the subgroup sets obtained by them. To do so, we first apply a novel algorithm to remove the redundant subgroups, which is based on the examples covered by the patterns and the statistical redundancy between them. Then, the similarity and quality procedures proposed can be applied over the subgroups obtained in the previous step. The quality evaluation procedure allows to consider the quality distribution of the subgroup set. Also, the similarity approach allows the user to select the algorithm that provide more different information of the dataset. Finally, different graphics for each method are presented to improve the comprehensibility of the results.

2 Subgroup Discovery: Redundancy and Comparison

The main objective of SD task is to identify interesting group of individuals, where interestingness is defined as a distributional unusualness with respect to a certain property of interest. In most SD algorithms, a set of the best qualified subgroups are provided, where its quality is defined as the mean values of the measures obtained by all the subgroups mined [6]. So, this typical comparison methodology involves the summarization of the results using average, and then the subgroup set with the highest values is selected as the best ones [1, 7, 8].

This comparison approach has some drawbacks related with the natures of the average. It is known that the average is affected by outliers and data that not follow a central distribution. For instance, let us consider two subgroup sets A and B, where A has almost all subgroups with high quality except a few of them and most of the subgroups from B present lower quality that the best ones from A. For this reason, the average of the quality measure obtained by B would be higher than the one obtained by A. Then, using the traditional methodology the B set would be better than A, even when A present more subgroups with better quality than B. The distribution of the subgroups among the quality measure domain is also important to the comparison of subgroup sets.

Other major problem that affect not only the average measure calculation, but the comprehensibility of the results is redundancy. Dependencies between the non-target attributes lead to large numbers of variations of a particular subgroup. Since many descriptions can have a similar coverage of the given data, this may lead to many redundant evaluations of the quality function and to a subgroup set that contains multiple descriptions of the same subpopulation. Moreover, the average of measures of a subgroup set with high redundancy level can be affected, as is shown in Fig. 1. So, the redundancy is an important factor to take into account in subgroup sets comparison [2, 10].

Fig. 1.
figure 1

Average of measure calculate with redundancy and without it for different dataset.

Redundant subgroups are those that cover a subset or a similar set of data records of some other subgroups [10]. Several approaches have been presented to detect and remove redundancy. Li et al. in [12] propose an interesting approach to detect and prune the redundant subgroups using an heuristic search and the error bounds of the OddsRatio measure [11]. In [2] a closure system is used to represent a subgroup by its coverage of a dataset. Van Leeuwen et al. [10] propose some selection strategies in order to eliminate redundancy in heuristic search algorithms. In general, these proposals employ one of two search space for redundancy detection: description or coverage space. The first one is more efficient but less precise that the second one.

3 A New Method to Evaluate Subgroups Discovery Algorithms

In this section, we present a new method to evaluate and compare SD algorithms, analyzing redundancy, quality and similarity of the subgroups obtained by different approaches. First, this method removes the redundant subgroups using a novel procedure based on the examples covered by the patterns and the statistical redundancy between them. Then, the quality and similarity procedures can be applied over the subgroups obtained in the first step. All their characteristics are presented in detail in the following.

Fig. 2.
figure 2

Graphics designed for the evaluated methods. (a) Redundancy Method, (b) Similarity Method, (c) Quality Method

Evaluating and Removing Redundancy

We propose a new procedure to identify whether two patterns are redundant using the follows properties: the ratio of examples covered by the two patterns and the statistical redundancy between them. We use the covered example ratio presented in [13], which represents the maximum percentage of covered examples by the two patterns regarding the examples covered for each pattern. If this ratio is higher than a threshold value \(CovRat_{min}\), then these patterns would appear to provide us with similar information of the search space. However, these patterns could be describing different class distribution that can be interesting for the users. Because of this, the statistical redundancy proposed in [12] is also calculated, which is based on the confidence intervals of OddsRatio. If the confidence intervals of the OddsRatio of the patterns overlap, then they are redundant.

The ratio of examples covered takes values in the range [0,1], where values close to 0 show that the rules cover a few common examples and values close to 1 that the rules cover almost the same examples. Notice that the \(CovRat_{min}\) threshold allows the user to determine the overlap degree of the compared subgroups. This is defined as \(CovRat(P_1,P_2) = MAX \left[ \tfrac{cov(P_1\,\wedge \,P_2)}{cov(P_1)} ,\tfrac{cov(P_1\,\wedge \,P_2)}{cov(P_2)}\right] \), where \(CovRat(P_1 P_2)\) represents the number of common examples covered by both subgroups \(P_1\) and \(P_2\), and \(CovRat(P_1)\) and \(CovRat(P_2)\) represent the number of examples covered by \(P_1\) and \(P_2\), respectively.

The OddsRatio of subgroup P is defined as \(OR(P) = \tfrac{TP\,*\,TN}{FP\,*\,FN}\), where TP, FP, FN, TN are the terms of the contingency table. The confidence interval of the OddsRatio is calculated as \(\left[ OR(P)e^{-w},OR(P)e^w\right] \), where \(w = z_{\alpha /2}*\sqrt{\tfrac{1}{TP}+\tfrac{1}{FP}+\tfrac{1}{FN}+\tfrac{1}{TN}}\). The critical value of the confidence interval for a 95% confidence is \(z_{\alpha /2}=1,96\).

Notice that, algorithms with a large percentage of redundant subgroups are less efficient and the subgroup set mined bring information that can potentially decrease the user ability to understand the results. To analyze the redundancy detected by this method, we propose to use a bar chart graphic that shows the percentage of redundant subgroups obtained by the algorithms considered on each dataset as can be seen in Fig. 2(a).

Similarity Between Mined Subgroup Sets

Similarity of subgroup sets can be defined by the amount of commons or similar subgroups between the sets obtained by the algorithms analyzed, where two patterns are common when they are redundant. To do so, all pattern mined by both algorithms are added to a pool. Then, the similar subgroups are identified using the method presented previously. In this way, we can obtain the set of common patterns and the subgroup set obtained only by each one of the algorithms analyzed. Notice that, the subgroup sets obtained on the same dataset partition are the ones to be considered in this method.

To better analyze this results, we propose to use a stacked bar graphic for each pair of algorithms in all the datasets as can be seen in Fig. 2(b). This figure shows a similarity comparison between alg1 and alg2, each bars represent the total number (100%) of patterns founded by both algorithms by dataset. The gray color represents the proportion of common subgroups found, and the black and white color are the ones obtained only by alg1 and alg2 respectively. We can see how the alg1 and alg2 obtain very similar results for the db2 dataset since the percentage of commons patterns is large. Moreover, it can be seen how alg1 extracts more information from the db2 dataset that alg2 since it can obtain all the common subgroups and it gets more different subgroups than the ones obtained by alg2. Notice that, this similarity analysis allows the user to select the algorithm that provide more different information of the dataset.

Comparing the Quality of the Mined Subgroups

The quality of a subgroup is defined by the values of different quality measures proposed in the literature as confidence, sensitivity and unusualness. For all of these measures the highest values are the better ones. Then, we can divide the range of the values obtained in a number of intervals \(N_{Interv}\) to identify the quality of a subgroup depending on the interval it belongs to. The \(N_{Interv}\) and its limits are determined by the user. In this work, we empirically set \(N_{Interv} = 3\) to identify the lowest, middle and highest quality intervals. The range of the values is divided into three equal parts to define the intervals limits. Then, the quality of a subgroup set can be also analyzed considering the percentage of patterns from each of these quality intervals, which allow us to consider the quality distribution of the subgroup set. Finally, the lower and upper bound of the range of values are defined by the minimum and maximum value found from all the patterns obtained by the algorithms analyzed.

To represent the results of this method, we employ a composite graphic like the one in Fig. 2(c). This comparison is pairwise, so the representation employs two bar graphics that have to be interpreted as a whole, since the intervals boundaries for both algorithms are determined by the conjunction of its results. Each subgraphic represents the results obtained by each algorithm, where each bar show the percentage of pattern that belong to each quality interval by dataset. The black, gray and white colors represents the lowest, middle and highest intervals of the domain of metrics respectively. The higher the percentage of patterns in the upper interval, the better is the subgroup set. Figure 2(c) shows how alg2 has more subgroups with high quality than alg1.

4 Experimental Validation

To validate the new evaluation method, we compare 3 well known SD algorithms: SD-map [1], Apriori-SD [8] and NMEEF-SD [3]. We have considered the follows 20 datasets from the UCI Repository of machine learning databases [4]: Appendicitis, Australian,Balance,Brest Cancer,Bridges, Bupa, Cleveland, Diabetes, Echo, German, Glass, Haberman, Heart, Hepatitis, Ionosphere, Iris, Led, Primary Tumor, Vehicle, Wine. The parameters of the analyzed algorithms are presented in Table 1. These parameters were selected using the recommendations of the authors. Apriori-SD and SD-map implementation don’t allow continuous variables, so a ID3 [15] discretization was applied. To develop the different experiments, the parameters of our proposal are defined as \(CovRat_{min} = 0.75\) and \(N_{Interv} = 3\). We consider the average results 10-fold cross-validation. In addition, as NMEEF-SD is stochastic, three runs are performed.

Table 1. Parameters of the algorithms.

In these experiments, first we apply the traditional methodology to evaluate the algorithms analyzed. Then, we use the new evaluation methods to show how they can improve the quality of the comparison, providing more information about it. In this study, we show a pairwise comparison between the algorithms considered. To apply the traditional methodology, we statically compare the average of the quality measures obtained by the algorithms in all datasets. We analyzed these results considering all the subgroups discovered and the ones found when the redundant subgroups are removed. We have used a Wilcoxon’s test [14] with a level of significance of 0.05, Table 2 shows the results obtained.

Table 2. Wilcoxon’s test (\(\alpha =0.05\)) on the different measures for the SD algorithms, where • indicates the algorithm in the row improves the algorithm of the column and ◦ the algorithm in the column improves the algorithm of the row. Upper diagonal shows the test’s results removing redundant subgroups, lower diagonal shows the test’s results with all the subgroups obtained.

The redundancy analysis also is shown in Fig. 3, where the percentage of redundant subgroups found are described for each dataset. The similarity between the algorithms studied is presented in Fig. 4. To consider the quality distribution of the subgroup sets mined using different quality intervals, we show in Fig. 5 the percentage of patterns for each algorithm than belong to the low, middle and high quality interval.

Fig. 3.
figure 3

Percentage of redundant subgroups obtained by each algorithm for all the datasets

Fig. 4.
figure 4

Percentage of common and individual patterns

Fig. 5.
figure 5

Pairwise comparison of the average measures distribution of the subgroup sets obtained for all algorithms for each dataset.

We can draw the following conclusions based on an analysis of the results presented of the pairwise comparison between the algorithms considered:

  • SD-map vs Apriori-SD: We can see from Fig. 3 how Apriori-SD obtain less redundant subgroups than SD-map. The similarity analysis presented in Fig. 4(a) shows that Apriori-SD discovers more knowledge than SD-map since it can obtain more than 50% of the total number of subgroups mined by both methods in most of the datasets. The statistical analysis shows that there are not significant difference between the average results of the confidence and unusualness measures. However, Fig. 5 shows how more subgroups mined by Apriori-SD obtain better values for these measures than the ones obtained by SD-map. So, Apriori-SD can be considered better than SD-map because it provides more diverse knowledge with better quality in most of the measures considered.

  • SD-map vs NMEEF-SD: The redundancy analysis shows that NMEEF-SD mines more redundant subgroups than SD-map. Table 2 shows how the statistical results between these two algorithms change when the redundant subgroups are removed. Moreover, if we analyze only the test’s results once the redundant subgroups are removed, we can see how the difference for the measures unusualness and sensibility are not significant. However, most of the subgroups obtained by NMEEF-SD get higher values for these measures than the ones mined by SD-map, as can be seen in Fig. 5. The confidence values obtained by SD-map are better than the values obtained by NMEEF-SD as the statistical results and Fig. 5 show. The similarity analysis shows how the subgroups obtained are very different, having few subgroups in common. Finally, both algorithms provide different knowledge to the users, where the subgroups mined by NMEEF-SD present better values for the measures unusualness and sensibility, who are more interesting in the SD task.

  • Apriori-SD vs NMEEF-SD: The statistical comparison between these algorithms don’t change when the redundant subgroups are removed, being NMEEF-SD better than Apriori-SD for unusualness and sensibility measure as is shown in Table 2. Figure 5 also shows that most of the subgroups mined by NMEEF-SD belong to the highest quality interval. These algorithms present low similarity between them. NMEEF-SD can be considered better than Apriori-SD because it provides more diverse knowledge with better quality in most of the measures considered.

5 Conclusions

In this paper, we propose a new method to evaluate and compare SD algorithms considering redundancy, quality and similarity between the subgroup sets obtained by them. First, this method removes the redundant subgroups using a novel procedure based on the examples covered by the patterns and the statistical redundancy between them. Then, despite previous researches, which estimate the quality of the algorithms using the average of the quality measures obtained from 10 fold cross validation, we perform a paired comparison between subgroup set obtained from the same chunk of the dataset to determine the quality distribution of the subgroup sets mined. Moreover, the method proposed can also determines how much a subgroup set is similar to another using a new procedure which gets the common subgroups between the sets obtained by the algorithms analyzed, where two patterns are common when they are redundant. Finally, the experimental validation shows how our proposal and its associated graphics can provide more useful information to the users in order to select the best algorithm for their SD’s problems.