Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter February 4, 2016

Reducing the Feature Space Using Constraint-Governed Association Rule Mining

  • Doreswamy and M. Umme Salma EMAIL logo

Abstract

Recent advancements in science and technology and advances in the medical field have paved the way for the accumulation of huge amount of medical data in the digital repositories, where they are stored for future endeavors. Mining medical data is the most challenging task as the data are subjected to many social concerns and ethical issues. Moreover, medical data are more illegible as they contain many missing and misleading values and may sometimes be faulty. Thus, pre-processing tasks in medical data mining are of great importance, and the main focus is on feature selection, because the quality of the input determines the quality of the resultant data mining process. This paper provides insight to develop a feature selection process, where a data set subjected to constraint-governed association rule mining and interestingness measures results in a small feature subset capable of producing better classification results. From the results of the experimental study, the feature subset was reduced to more than 50% by applying syntax-governed constraints and dimensionality-governed constraints, and this resulted in a high-quality result. This approach yielded about 98% of classification accuracy for the Breast Cancer Surveillance Consortium (BCSC) data set.

MSC 2010: 68U35

1 Introduction

Recent advancements in science and technology and advances in the medical field have paved the way for the accumulation of huge amounts of medical data. The medical data are stored in the digital repositories and are used for future endeavors. Such availability of a vast collection of medical data has stirred the interest of many data analysts, medical examiners, and scientists to make the best usage of such a huge amount of data to provide solutions for medical related problems such as issues related to the identification of various diseases, security issues, policy improvements in healthcare, and many more. Even though issues related to security and policies are important in their own way, the most interesting and the most challenging issue is identification of diseases in a smart way. Medical data mining (MDM) has become a promising scientific field, where the majority of researchers focus on finding more improvised and better solutions for identification, classification, and analysis of various diseases and their effects. Of the many diseases, the major killer nowadays is breast cancer, which kills almost ten million people per year worldwide [20]. Scientists are doing their best to provide the best treatment for this disease, but it is a challenging task as they are unable to understand the disease pattern and its change over time; hence, they are seeking the help of researchers and data analysts to analyze the data and find the patterns that can help them to find out the type of breast cancer the person is suffering from, its characteristic features, and its stages and, thus, recommend suitable medication and treatment for it.

Mining medical data is the most challenging task as the data are subjected to many social concerns and ethical issues. Moreover, medical data are more illegible as they contain many missing and misleading values and may sometimes be faulty. Thus, it is very important to deal with medical data with great care. A single mistake may depict a healthy person as an unhealthy one and an unhealthy person as healthy leading to many complications and deadly results. Any data mining task is done in three phases – pre-processing, processing, and post-processing. MDM is a field where much emphasis is given to the pre-processing phase, where the data should be accurate with no missing, misleading, or faulty values. Apart from it, the canonical truth is that the quality of data improves and determines the quality of the entire automated system designed for the analysis and detection of diseases. Pre-processing may involve many operations such as data cleaning, data transformation, data reduction, etc. However, this paper focuses mainly on data reduction, which involves selection of strong predominant attributes from a set of available attributes, which are relevant enough to accomplish the task of classification with more accuracy. This paper is organized into five sections: Section 1 includes a brief introduction about the medical data and need for pre-processing of medical data as mentioned above and is followed by Section 2, which provides all the background details required to understand the concept. Section 3 is the proposed methodology, Section 4 gives us the experimental details, and finally Section 5 deals with conclusion and future work.

2 Background

2.1 Motivation

Two things provided the motivation to carry out this work; first is the alarming need to find a solution to the diagnosis of breast cancer, which, after cardiovascular disease, is the second most leading cause of death in women. Worldwide, about 10 million people per year are diagnosed with cancer, with more than 6 million dying of the disease, and over 22 million people in the world are cancer patients [20]. Second is the growing applications of association rule mining (ARM) in various fields [12, 23], etc. Application of ARM in the field of MDM [5] has raised our curiosity for applying it to find a solution for the breast cancer problem. It is the open truth that the rate of accuracy is dominated by the quality of data; the more the data are clean and precise, the more prominent the results will be. Thus, we came up with an idea of finding out the precise data (attribute subset) using constraint-governed ARM which can aid in efficient data mining tasks.

2.2 Related Work

In data mining, feature selection is the most important phase, as the result of the entire mining process rests on the shoulders of the feature set upon which the action is to be performed. In MDM, selecting a strong feature subset is highly recommended for the building of automated classifiers because, as the number of features is reduced, the cost of building a computational model decreases too. When it comes to diagnosis and prognosis of the diseases, the smaller the feature subset, the larger is its importance as it reduces the cost of diagnostics and test costs.

Many researchers came up with different ideas for selecting feature subset-subspace based approach [9], namely, a hybridized K-means clustering approach for high dimensional data set [4], feature selection using Relief method [14], a hybrid feature selection approach using Relief and genetic algorithm [30], feature selection for the classification of heart disease using information gain method [13], and an ontology-based feature selection of drug data using gain ratio [18] were used. Gini index, one of the best single measures of inequality [6], was used as an approach for selecting relevant features in text data [21]. Feature selection was done through particle swarm optimization [24]. Rough sets concept was also applied for feature selection. Some concepts, namely, a rough set-based approach [3] and a rough feature selection algorithm with a multi-granulation view [15], were found to be beneficial in selecting relevant features from large data sets. A hybrid feature selection method based on simple association rules and artificial neural networks was used for classification of erythemato-squamous diseases [11] where for the first time association rules were used for the classification of medical data. Whatever may be the approach, the aim remains the same, i.e. deriving precise and dominant features capable of performing the desired task with ease.

2.3 Aim and Challenges

2.3.1 Aim

The main aim of this research work is to select a minimal, strong, and expressive feature subset from the breast cancer data using constraint-governed ARM technique supported by various interestingness measures, which can be later used to get accurate classification and prediction results.

2.3.2 Challenges

All techniques come with their own unique features and challenges. The main challenge of the ARM is the selection of rules that can serve the user purpose. Even though the rules are formed in a probabilistic way depending upon Support and Confidence, these two measures are not enough to select the rules, because lower Support and lower Confidence can lead to the formation of many undesirable rules whereas higher Support and Confidence can hinder the generation of important rules.

2.3.3 Solution for the Challenges

Many researchers have made use of various quantitative measures like lift, coverage, leverage, and so on to address this challenge, but in the present study, other interestingness measures have been opted for along with the constraints that can be applied on the rules to extract the desired rules. The working of this solution is explained in detail in Section 3, and brief information of the various interestingness measures that decide the quality of association rules is given in Section 2.4.

2.4 Preliminary View

A brief preliminary view of association rules, constraints, and interestingness measures governing the association rules are provided in this section so that it will be easy to understand the proposed methodology.

2.4.1 Association Rules

Association rule learning, or more popularly called as the ARM, is a technique used to uncover the association among the variables, to come up with interesting patterns or relations among them. It is intended to identify the strong rules discovered in the databases, using different measures of interestingness [22].

The general association rules for an item set I is of the form PQ where PI, QI, and PQ=∅; however, it is mandatory that every individual rule should satisfy minimum Confidence as well as minimum Support threshold.

The definitions of Support and Confidence along with the mathematical representations are as follows.

In database D, Supports” of a rule PQ is the percentage of transactions where both P and Q occur together, which is nothing but the probability of the union of P and Q (i.e. PQ), and it is mathematically represented by equation (1).

(1)s(PQ)=p(PQ)

In database D, the Confidencec” of a rule indicates the percentage of transactions containing P that also contain Q and is mathematically represented by equation (2).

(2)c(PQ)=(s(PQ))/(s(P))=p(P|Q)

In Support, p indicates the probability, and in Confidence p indicates the conditional probability.

ARM is one of the strongest explorative techniques that extends its applications in vast fields and forums, from business applications to human resource management, from natural language processing to text mining, and now recently in MDM. In MDM, ARM can be used for sequence analysis, link analysis, unique data analysis, etc. Its usage depends upon the data and on its applications.

2.5 Constraints for Association Rules

A constraint for an association rule is a user-specified restriction, intuition, or an expectation to restrict the generation of unwanted rules or to increase the probability of generation of interesting rules. It is a smart means of compacting the database’s search space. When one or many constraints are applied to the association rules, it is called as constraint-based ARM.

There are four types of constraints for mining association rules, and they are as follows:

  1. Knowledge type constraints: These are the constraints which decide what type of knowledge is required by the user, i.e. whether the user’s focus is on either association or correlation.

  2. Data constraints: These are the constraints which decide what type of data is required for the relevant tasks.

  3. Interestingness constraints: These are the constraints which decide what interestingness measures should be specified for threshold. Usually, statistical measures like coverage, lift, leverage, etc. are given as constraints. However, the most commonly used are Support and Confidence.

  4. Rule constraints: These are the constraints which decide the form of rules to be mined. These constraints are usually represented in the form of templates (syntactical forms) which decide the structure of antecedents and consequents [10].

2.6 Interestingness Measures

The interesting measures are the measures that are intended to select and rank the patterns according to their potential interest to the user [8]. There are three types of interestingness measures, and they are as follows:

  1. Objective measures: These measures are based on the structure of the discovered patterns and the statistics underlying them. These are based purely on raw data. For example, support, generality, conciseness, etc. [10].

  2. Subjective measures: These measures are based on the user’s beliefs and disbeliefs in data. A thorough background knowledge is required for the subjective measure. For example, surprisingness, novelty, etc. [10].

  3. Semantic measures: These measures are based on the semantics and explanations of the patterns and involves domain knowledge from the user [28].

3 Proposed Methodology

It is obvious that we get a large set of rules when data are subjected to association or correlation analysis, but not all rules are useful and/or applicable to serve the user’s need. Our main aim here is to select the set of those rules which are strong enough to accomplish the need of feature selection. A strategy is proposed, which would help the user to identify a small set of concise, reliable, and actionable rules having large coverage along with acceptable Support and Confidence. The attributes selected from such rules would be strong enough and perform classification tasks much better.

Our proposed methodology works in the following steps:

  1. Finding frequent item sets using an a priori algorithm

  2. Rule generation

  3. Pruning the rules by restricting them to satisfy some constraints

  4. Selecting the top N attributes from the reduced rules based upon various interestingness measures

  5. Validating the selected feature subset

Step 1: Finding frequent item sets using an a priori algorithm

The primary step in ARM is determining frequent item sets. The frequent item sets help in producing strong rules which are capable of identifying general trends or interesting relations or unique patterns in a database. The results to be determined depend upon the type of database and the purpose for which it is to be mined. Hence, generating the rules is said to be subjective in nature. There are various algorithms that are used to find out all possible frequent items from a given database, and one such algorithm is an a priori algorithm. An a priori algorithm is an iterative algorithm that uses prior knowledge of frequent item set properties to generate frequent item sets; hence, it is called as an a priori algorithm.

An a priori algorithm uses a level-wise search approach to determine frequent items. In the first level, a count of each candidate is done and is labeled as C1, and then a set of individual frequent items that satisfies certain minimum Support is generated (say L1) from C1 candidate set. Next, with the help of L1 a candidate set consisting of two items in one transaction (say C2) is formed, and again depending upon the minimum Support, a set of two frequent items (say L2) is generated; in the same way C3 is generated, which in turn gives us a set of three frequent items (Say L3), and this process can be carried up to N levels. Finding of each Li requires one full scan of the entire database. Even though the search space is large and the power set grows exponentially, the search can be handled efficiently and can be pruned by making use of a unique property called as an a priori property which states that “All non-empty subsets of a frequent item set must also be frequent” [10]. This property belongs to a category called as the antimonotoneproperty of a set also referred as downward closure property. The pseudo code for an a priori algorithm introduced by [1] is as follows:

Step 2: Rule Generation

For a given database D containing a set of transactions T where the transactions contain a set of items I={I1, I2, …, Im }. In other words, T is a proper subset of I. Frequent item sets are obtained using an a priori algorithm. After obtaining a set of frequent items, the frequent item set is directly exposed to obtain strong association rules based upon conditional probability. We know that the general form of an association rule is PQ, where PI, QI, and PQ=∅. Depending upon the user-specified Support and ConfidenceN, a number of rules are generated, but the biggest challenge lies in selecting the interesting rules and is considered to be the most important and promising task, as not all the rules generated are interesting or important to the user.

Relying just on Support and Confidence measures may not serve to a great extent, as fixing a lower percentage of Support can lead to generation of uninteresting rules, while fixing of higher percentage of Support and Confidence can lose many of the interesting rules. The solution for this problem is described in the next step.

Algorithm 1:

Apriori Algorithm Pseudo code.

1:procedure Apriori(T, minSupport) ▷ T is the database and minSupport is the minimum Support
L1=frequent items
2:  for<k=2; Lk – 1!=NULL; k++>do
3:   < do Ck= candidates generated from Lk–1>
   ▷ that is Cartesian Product Lk–1 × Lk–1 and eliminating any k-1 size item set that is not frequent
4:   for <t++> do
5:    < do # increment the count of all candidates in
6:    Ckthat are contained in t
7:    Lk = candidates in Ck with minSupport >
8:   end for
9:  end for
   return Ck , Lk
10:end procedure

Step 3: Pruning the rules by restricting them to satisfy some constraints

We have come up with two solutions in order to overcome the problem of selection of strong and interesting rules.

  1. Applying various constraints and

  2. Applying various other interestingness measures

The first solution is called as the constraint-governed ARM, where we apply certain constraints to the rules either in the form of queries or through a user interface. Here the generated rules are subjected to dimensionality constraints and syntax-based constraints so that interesting rules are deduced from a large set of generated rules.

Dimensionality constraint is a constraint which restricts the dimensionality of a rule, i.e. dimensionality constraint decides how many number of attributes are to be present on antecedent and consequent sides.

The syntax-based constraint is a constraint which restricts only those rules which satisfy a particular syntax of the type defined by the user. It is also called as the Metarule-guided mining, as it implies rule on a rule. A template for Metarule constraint is given by equation (3).

(3)P1P2Pk=Q1Q2Qm

where P1, P2, …, Pk are the antecedents and Q1, Q2, …, Qm are the consequents. In our work we consider syntax-based constraints where the antecedent is restricted to always being a class variable, and hence equation (1) is reduced to equation (4).

(4)P1Q1Q2Qm

where P1 is either benign or malignant.

By applying syntax-based constraint, the number of rules generated gets reduced from k rules to l rules (i.e. k>l and k, l>0). The obtained l rules are further subjected to dimensionality constraint where we focus on the consequent part. Here the dimension/level of the consequent is restricted to 1. Thus, equation (4) is further reduced to equation (5).

(5)PiQj

After obtaining a reduced set of constraint-based association rules, our intention is to select the top N attributes depending upon various interestingness measures as mentioned in the next step.

Step 4: Selecting topNattributes based on interestingness measures

Here we have chosen the generality, reliability, leverage, conciseness, and utility measures for selecting the attributes.

Generality: Generality is also termed as coverage “cv” and is the amount of data covered by the rule PQ and is given by the following equation (6).

(6)cv(PQ)=s(P)=p(P)

Since it is nothing but the probability of the antecedent P, it is also referred to as antecedent Support.

Leverage:Leveragel” measures the difference of P and Q appearing together in the data set and what would be expected if P and Q were statistically dependent (Piatetsky leverage). The lesser the leverage, the more is the chance of P and Q appearing together. Leverage is given by the following equation (7).

(7)(PQ)=p(P and Q)(p(P)p(Q))

Reliability: A pattern or rule is said to be reliable if it is more applicable. The applicability of the pattern is measured by its Confidence, the higher the Confidence, the higher is the reliability and vice versa.

Conciseness: A pattern is meant to be concise if it has a few attribute-value pairs. Thus, we restrict the generated rules by applying syntax and dimensionality constraints so that they become concise.

Utility: If the pattern serves the purpose then it is said to be of utility. The patterns obtained by applying the dimensionality constraints can serve us with high utility in selecting the strong features.

Step 5: Validating the selected feature subset

A comparative study is carried out by comparing the proposed method with other widely used methods. In order to validate the performance, the feature subset obtained from the proposed method and the attributes subsets obtained from different methods are subjected to support vector machine (SVM), where we check out the rate of accuracy of classification using the selected attributes. The results obtained are interesting and are tabulated in Section 4.2.

The entire procedure of the proposed methodology is represented in the form of a flow chart in Figure 1.

Figure 1: Flow Chart of the Proposed Model to Reduce the Feature Space Using Constraint-governed Association Rule Mining.
Figure 1:

Flow Chart of the Proposed Model to Reduce the Feature Space Using Constraint-governed Association Rule Mining.

4 Experimental Study

The main aim of the experiment is to select top N attributes from a list of attributes of a given database using constraint-governed ARM. The experiment is carried out on an open source software tool called Orange [19]. Orange is a component-based data mining tool used for analysis and visualization of data used by both experts and novice users [19]. In Orange, the Associate module has been used to generate association rules. It consists of the following five widgets:

  1. Frequent item widget, which generates frequent items

  2. Frequent item explorer widget, which displays the frequent item sets

  3. Association rule widget, which generates association rules

  4. Association rule filter widget, which provides a user interface to apply different constraints to the generated rules and thereby provides a filtering mechanism

  5. Association rule explorer widget, which displays the obtained association rules

The feature subset obtained is validated for its usefulness by subjecting it to SVM, which is carried out using SVM widget and using three different kernel functions, namely, the linear kernel function, the polynomial kernel function, and the exponential kernel function [19].

The experiment is carried out on three different categorical benchmark breast cancer data sets, namely, Breast Cancer Surveillance Consortium (BCSC) data set [2], Ljubljana data set [17], and Wisconsin data set [27]. The brief description of data sets is tabulated in Table 1. It should be noted that the BCSC data set contains 181,903 records, the Ljubljana data set contains 286 records, and the Wisconsin Breast Cancer Diagnostic (WBCD) data set contains 699 records. These data sets are specifically chosen to check how well the proposed model performs, irrespective of the data size.

Table 1

Details of Various Breast Cancer Data Sets Used for Feature Selection.

HeaderDetails
Data set nameBCSCLjubljanaWBCD
Data set characteristicsMultivariateMultivariateMultivariate
Attribute characteristicsCategoricalCategoricalCategorical
Associated tasksClassification and predictionClassification and predictionClassification and prediction
Number of instances181,903286699
Number of attributes15+19+110+1
Missing values?NoYesYes
Binary class attributeCancer (with values yes or no)Recurrence (containing values, no recurrence events, and recurrence events)Class (with values benign and malignant)
Other attributes1. Menopause1. Age1. Sample code
2. Age group2. Menopause2. Clump thickness
3. Density3. Tumor size3. Uniformity of cell size
4. Race4. Inv nodes4. Uniformity of cell shape
5. Hispanic5. Node Caps5. Marginal adhesion
6. BMI6. Degree of malignancy6. Single epithelial cell size
7. Age first7. Breast7. Bare nuclei
8. Nrelbc8. Breast quad8. Bland chromatin
9. BrstProc9. Iirradiates9. Normal nucleoli
10. LastMamo10. Mitoses
11. SurgMeno 12.HRT
13. Invasive
14. Training
15. Count

4.1 Model Parameters

For the entire experimental work, different parameters were set up, and they remain the same for all data sets. In order to obtain the optimal number of interesting rules, the minimum Support was set at 0.49 (49%), and the minimum Confidence was set at 0.75 (75%). We restrict the level of generation of frequent item sets for 10 iterations as the stopping criterion. After checking with various values, the parameters for SVM were set as follows to provide better results.

The gamma value “γ” is set to 1; the penalty parameter “c”, also known as box constraint, is set to 0; the order of degree “d” for polynomial function is set to default value 3; and finally, the tolerance value “t” is fixed to 0.5.

As mentioned in the previous section, our methodology works in five phases, and the results obtained in each phase are given in Section 4.2.

4.2 Results

4.2.1 Result of Step 1: Finding Frequent Item Sets Using an A Priori Algorithm

The frequent item sets for all the data sets are found using an a priori algorithm with the predefined parameters. It is to be noted that before subjecting the data sets to generate frequent item sets, all the records containing missing values were removed. Table 2 gives us the details of frequent item sets generated for the given breast cancer data sets.

Table 2

Frequent Item Sets Generated for All Three Breast Cancer Data Sets Used for Feature Selection.

Size of frequent item setsBCSCLjubljanaWBCD
Size of L1 item sets666
Size of L2 item sets766
Size of L3 item sets445
Size of L4 item sets412
Size of L5 item sets1NoneNone
Size of L6 item sets1NoneNone

Further, Table 3 shows the number of general association rules generated for the given breast cancer data sets.

Table 3

Number of Rules Generated for All Three Breast Cancer Data Sets.

Data setNo. of general association rules generated
BCSC4619
Ljubljana39
WBCD112

4.2.2 Result of Step 2: Generating Association Rules for Feature Selection

With the help of an a prioiri algorithm, we obtain general association rules tabulated in Table 3.

4.2.3 Result of Step 3: Pruning the Rules by Restricting Them to Satisfy Various Constraints

In this step, the results generated on applying syntax constraints and dimensionality reduction constraints on all the three breast cancer data sets and the significance of the results is presented. Table 4 gives us a summary of the general association rules generated.

Table 4

Number of General Association Rules Obtained for All Three Breast Cancer Data Sets Used for Feature Selection.

RulesBCSCLjubljanaWBCD
General association rules generated461939112
Examples covered by selected rules500270602
Matching-mismatching example500-0239-31502-100

Further, Tables 5 and 6 give us the number of reduced association rules obtained on applying syntax-based constraints and dimensionality-based constraints, respectively.

Table 5

Number of Syntax-based Reduced Association Rules Obtained for All Three Breast Cancer Data Sets Used for Feature Selection.

RulesBCSCLjubljanaWBCD
Reduced rules on applying syntax constraint63518
Examples covered by selected rules391201444
Matching-mismatching example391-0189-12402-02
Percentage of rules reduced98.63587.17983.392
Table 6

Number of Dimensionality Based Reduced Association Rules Obtained for All Three Breast Cancer Data sets Used for Feature Selection.

RulesBCSCLjubljanaWBCD
Reduced rules on applying dimensionality constraint637
Examples covered by selected rules391201444
Matching-mismatching example391-0189-12422-02
Percentage of rules reduced99.8792.30793.750

From Table 5 it is clear that after applying syntax-based constraints, the rules have been reduced by 98.63% in the BCSC data set (i.e. just 1.36% of rules are enough to carry out the required task); it is reduced by 87.17% in the Ljubljana data set and by 83.39% in the WBCD data set by covering more than three fourths of the examples of the data sets and yet maintaining the accuracy of the system in consistent state.

Further, when dimensionality constraints were applied as shown in Table 6, the number of rules was reduced by 99.87% in the BCSC data set, 92.30% in the Ljubljana data set, and by 93.75% in the WBCD data set; however, the number of examples covered remained the same (i.e. three fourths of data sets). The big leap in reduction of rules facilitates in reducing the computation time and increasing the efficiency of the model.

4.2.4 Result of Step 4: Selecting Top N Attributes from the Reduced Rules Based upon Various Interestingness Measures

Attributes selected using the interestingness measures for the breast cancer data sets are listed in Table 7. Table 7 also provides a comparative view of the proposed method with various other feature selection techniques.

Table 7

Selection of Top N Attributes from Various Feature Selection Methods.

Attribute selection methodsAttributes selected for data sets
BCSCLjubljanaWisconsin
Relief13, 3, 2, 1, 6, and 114, 5, 6, 3, and 94, 3, 7, 8, and 5
Information gain3, 13, 2, 1, 6, and 46, 4, 3, 5, and 94, 3, 7, 8, and 6
Gain ratio13, 2, 1, 3, 6, and 45, 4, 6, 9, and 37, 3, 4, 9, and 6
Gini index13, 3, 2, 1, 6, and 46, 3, 4, 5, and 93, 4, 7, 6, and 8
Proposed method3, 1, 2, 11, 12, and 134, 5, and 93, 4, 6, 9, and 7

4.2.5 Result of Step 5: Validating the Selected Feature Subset

The applicability of the selected attributes obtained from the proposed method was validated by subjecting them to test in terms of classification accuracy and time complexity. The accuracy of the proposed model was checked by using the upper and lower bound-based support vector machine (also known as v-SVM) along with the subset of attributes selected from various popularly used attribute selection techniques. Figures 24 give us a comparative analysis of the performance of the selected attributes of our methods and other popular methods in a simple and precise way. The three kernel functions – linear, polynomial, and Radial Basis Function(RBF) – were chosen, and the performance of each function was validated using 10-fold cross validation.

Figure 2: Validating the Performance of BCSC Feature Subset.
Figure 2:

Validating the Performance of BCSC Feature Subset.

Figure 3: Validating the Performance of Ljubljana Feature Subset.
Figure 3:

Validating the Performance of Ljubljana Feature Subset.

Figure 4: Validating the Performance of WBCD Feature Subset.
Figure 4:

Validating the Performance of WBCD Feature Subset.

The BCSC feature subset obtained from the proposed work gives the best performance, which is higher than all of its counterparts. Since the size of the BCSC data set is too large, it is divide into chunks of 500 each to validate it using v-SVM. The final result obtained by using the proposed method for the linear function-based v-SVM is found to be 93.4%, for the polynomial function-based v-SVM it is 81.4%, and for the RBF- based v-SVM it is 99.8%.

The Ljubljana feature subset obtained from the proposed work gives almost a tough competition to the other techniques, and the performance of the RBF-based v-SVM is found to be the highest. The accuracy of the proposed method for the linear function-based v-SVM is found to be 72.03%, for the polynomial function-based v-SVM it is 71.34%, and for the RBF-based v-SVM it is 91.27%.

The WBCD feature subset obtained from the proposed work gives almost the same result as that of information gain method, and here also the performance of RBF-based v-SVM is found to be the highest. The accuracy of the proposed method for the linear function-based v-SVM is found to be 72.03%, for the polynomial function-based v-SVM it is 71.34%, and for the RBF-based v-SVM it is 91.27%. However, there exists an exception of the relief-based features performing slightly higher than all the methods in the polynomial function-based v-SVM.

Time complexity is one of the major factors that decide the efficacy of the proposed model. In order to find out whether our proposed model is comparatively better than the other models, a comparative analysis of the time complexities of Relief, information gain, gain ratio, Gini index, and the proposed method used for the feature selection is performed and is tabulated in Table 8.

Table 8

Comparision of Time Complexities of Various Methods Used for Feature Selection.

MethodComplexity
ReliefO(m * n * f)
Information gainO(f * f)
Gain ratioO(nlog2(n))
Gini indexO(n * f)
Proposed methodO(No. of association rules generated + size of (L))

The complexity of an algorithm depends upon various factors, such as the size of the input, type of data structure used for representation, i.e. whether it is an array or tree, and whether the data are considered to be sorted or not. In order to compare the time complexities of various methods, it is assumed that the data are sorted and are represented using a tree structure.

For a given data set containing n training instances (or transactions), with f number of attributes (features) and m iterations, the time complexity of various techniques used is represented as follows.

The complexity of Relief is O(m * n * f) [25]. The complexity of information gain is O(f * f) [16]. The complexity of gain ratio is O(nlog2(n)) [26]. The complexity of Gini index is O(n * f) [29]. The complexity for generation of association rules is O(No. of association rules generated + size of (L)) [7] where L is the lattice of large item sets.

Since association rule generation has the complexity of O(No. of association rules generated + size of (L)), for our proposed model the complexity is reduced by more than 50%. This is because the rules generated by applying syntax-based constraints, and dimensionality-based constraints are almost half the number of original rules obtained without applying constraints. Thus, the proposed model is less complex when compared to other models.

5 Conclusion and Future Work

The proposed method serves the need of selecting a strong feature subset from a given data set, irrespective of type of data. By using constraint-governed ARM, not only are the association rules pruned but also the constraints made to serve the need of finding the relevant subset of features. The relevancy of the features is strongly supported by the interestingness measures such as lift, coverage, and utility measures.

From the experimental results, it is clear that the proposed method produces higher result when subjected to different SVM kernels, the exception being for the polynomial function (which showed exceeded results in gain ratio for the Ljubljana data set). Through all these obtained results, it can be concluded that the features selected from the ARM governed by syntax constraints and dimensionality constraints along with the interesting measures yields comparatively good results than its counterparts. However, the only exception is in the relief-based features, which outperforms its counterparts in the polynomial function-based v-SVM. Nonetheless, on the whole, the proposed work provides an assurance that the selected feature subset reduces the feature space by almost more than 50% and, at the same time, produces good accuracy.

Apart from accuracy, time complexity also plays a major rule. From Table 8 it can be said that the proposed method is much better in terms of time complexity, as the number of rules generated through the proposed method is less than three fourths the number of rules generated using the general method of ARM and is a significant change. Thus, it can be concluded that the proposed model is better in reducing the feature space, which is required for processing activities like classification and prediction.

In the future, work will be carried out to overcome the exceptions of the proposed method, by improvising the parameter settings and/or by improvising the selection criteria. Apart from working on categorical data, the future vision is to select the strong and relevant attributes from the numerical data. Quantitative ARM, which can deal with a large quantitative data, can be used for feature selection from numerical data sets.


Corresponding author: M. Umme Salma, Department of Computer Science, Mangalore University, Mangalagangothri, Mangalore, 574199, India, e-mail:

Acknowledgments

BCSC: Breast cancer data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer Surveillance Consortium. A list of the BCSC investigators and procedures for requesting BCSC data for research purposes are provided at: http://breastscreening.cancer.gov/. Ljubljana Hospital and its team: Ljubljana Breast Cancer Data was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Slovenia. Thanks go to M. Zwitter and M. Soklic for providing the data. Wisconsin Hospital and its team: A Wisconsin Original Breast Cancer breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. A hearty thanks to the institutes and personnel, my teachers and my fellow researchers for for their timely support. A special thanks to the unknown reviewers, editor and proof reading team whose suggestions turned a raw draft into a fine paper.

  1. Funding: Maulana Azad National Fellowship for Minority Students, (Grant/Award Number: ‘F1-17/2013-14/MANF-2013-14-MUS-KAR-24350’).

Bibliography

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules, in: Proc. 20th int. conf. very large databases, VLDB, volume 1215, pp. 487–499, 1994.Search in Google Scholar

[2] Breast cancer surveillance consortium. http://breastscreening.cancer.gov/rfdataset/. Accessed 10 February, 2013.Search in Google Scholar

[3] H.-L. Chen, B. Yang, J. Liu and D.-Y. Liu, A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis, Expert Syst. Appl.38 (2011), 9014–9022.10.1016/j.eswa.2011.01.120Search in Google Scholar

[4] B. Dash, D. Mishra, A. Rath and M. Acharya, A hybridized k-means clustering approach for high dimensional dataset, Int. J. Eng. Sci. Technol.2 (2010), 59–66.10.4314/ijest.v2i2.59139Search in Google Scholar

[5] J. Demšar, T. Curk, A. Erjavec, Č. Gorup, T. Hočevar, M. Milutinovič, M. Možina, M. Polajnar, M. Toplak, A. Starič, M. Štajdohar, L. Umek, L. Žagar, J. Žbontar, M. Žitnik and B. Zupan, Orange: data mining toolbox in python, J. Mach. Learn. Res.14 (2013), 2349–2353.Search in Google Scholar

[6] J. L. Gastwirth, The estimation of the Lorenz curve and gini index, Rev. Econ. Stat.54 (1972), 306–316.10.2307/1937992Search in Google Scholar

[7] M. L. Gavrilova and M. Gavrilova, Computational Science and Its Applications-ICCSA 2006: Pt. 4: International Conference, Glasgow, UK, May 8–11, 2006, Proceedings, volume 4. Springer Science & Business Media, 2006.Search in Google Scholar

[8] L. Geng and H. J. Hamilton, Interestingness measures for data mining: a survey. ACM Comput. Surv. (CSUR)38 (2006), 9.10.1145/1132960.1132963Search in Google Scholar

[9] S. Gunal and R. Edizkan, Subspace based feature selection for pattern recognition, Inform. Sciences178 (2008), 3716–3726.10.1016/j.ins.2008.06.001Search in Google Scholar

[10] J. Han and M. Kamber, Data mining, Southeast Asia edition: concepts and techniques, Morgan Kaufmann, 2006.Search in Google Scholar

[11] O. Inan, M. S. Uzer and N. Ylmaz, A new hybrid feature selection method based on association rules and pca for detection of breast cancer, Int. J. Innovative Comput. Inform. Control9 (2013), 727–729.Search in Google Scholar

[12] N. Jiang and L. Gruenwald, Research issues in data stream association rule mining. ACM Sigmod Record35 (2006), 14–19.10.1145/1121995.1121998Search in Google Scholar

[13] A. Khemphila and V. Boonjing, Heart disease classification using neural network and feature selection, in: Systems Engineering (ICSEng), 2011 21st International Conference on, pp. 406–409. IEEE, 2011.10.1109/ICSEng.2011.80Search in Google Scholar

[14] K. Kira and L. A. Rendell, The feature selection problem: traditional methods and a new algorithm, in: AAAI, volume 2, pp. 129–134, 1992.Search in Google Scholar

[15] J. Liang, F. Wang, C. Dang and Y. Qian, An efficient rough feature selection algorithm with a multi-granulation view, Int. J. Approx. Reason.53 (2012), 912–926.10.1016/j.ijar.2012.02.004Search in Google Scholar

[16] H. Liu and H. Motoda, Feature selection for knowledge discovery and data mining, Springer Science & Business Media, 1998.10.1007/978-1-4615-5689-3Search in Google Scholar

[17] Ljubljana Breast Cancer Dataset. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer. Accessed on 10.02.2013.Search in Google Scholar

[18] S. Lu, Y. Ye, R. Tsui, H. Su, R. Rexit, S. Wesaratchakit, X. Liu and R. Hwa, Domain ontology-based feature reduction for high dimensional drug data and its application to 30-day heart failure readmission prediction, in: Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th International Conference Conference on, pp. 478–484. IEEE, 2013.10.4108/icst.collaboratecom.2013.254124Search in Google Scholar

[19] Orange Software Open Source. http://orange.biolab.si/. Accessed on 10.02.2013.Search in Google Scholar

[20] O. O. Odusanya and O. O. Tayo, Breast cancer knowledge, attitudes and practice among nurses in Lagos, Nigeria, Acta Oncol.40 (2001), 844–848.10.1080/02841860152703472Search in Google Scholar PubMed

[21] H. Park, S. Kwon and H.-C. Kwon, Complete gini-index text (git) feature-selection algorithm for text classification, in: Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on, pp. 366–371. IEEE, 2010.Search in Google Scholar

[22] G. Piatetsky-Shapiro, Discovery, analysis and presentation of strong rules, in: Knowledge Discovery in Databases, pp. 229–238, 1991.Search in Google Scholar

[23] J. Rong, H. Q. Vu, R. Law and G. Li. A behavioral analysis of web sharers and browsers in hong kong using targeted association rule mining, Tourism Manage.33 (2012), 731–740.10.1016/j.tourman.2011.08.006Search in Google Scholar

[24] A. Unler and A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems. Eur. J. Oper. Res.206 (2010), 528–539.10.1016/j.ejor.2010.02.032Search in Google Scholar

[25] J. W., G. G. Yen and M. M. Polycarpou, Advances in Neural Networks-ISNN 2012: 9th International Symposium on Neural Networks, ISNN 2012, Shenyang, China, July 11-14, 2012: Proceedings. Springer, 2012.Search in Google Scholar

[26] S. Wang, G. Yu and H. Lu, Advances in Web-Age Information Management: Second International Conference, WAIM 2001, Xi’an, China, July 9-11, 2001. Proceedings, volume 2. Springer Science & Business Media, 2001.10.1007/3-540-47714-4Search in Google Scholar

[27] Wisconsin Breast Cancer Diagnostic. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). Accessed on 10.02.2013.Search in Google Scholar

[28] Y. Yao, Y. Chen and X. Yang, A measurement-theoretic foundation of rule interestingness evaluation, in: Foundations and Novel Approaches in Data Mining, pp. 41–59. Springer, Berlin Heidelberg, 2006.10.1007/11539827_3Search in Google Scholar

[29] M. J. Zaki and C.-T. Ho, Large-scale parallel data mining. Number 1759. Springer Science & Business Media, 2000.10.1007/3-540-46502-2Search in Google Scholar

[30] L.-X. Zhang, J.-X. Wang, Y.-N. Zhao and Z.-H. Yang, A novel hybrid feature selection algorithm: using relieff estimation for ga-wrapper search, in: Machine Learning and Cybernetics, 2003 International Conference on, volume 1, pp. 380–384. IEEE, 2003.Search in Google Scholar

Received: 2014-12-31
Published Online: 2016-2-4
Published in Print: 2017-1-1

©2017 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 23.4.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2015-0059/html
Scroll to top button