Elsevier

Applied Soft Computing

Volume 69, August 2018, Pages 704-718
Applied Soft Computing

Data analysis framework of sequential clustering and classification using non-dominated sorting genetic algorithm

https://doi.org/10.1016/j.asoc.2017.12.019Get rights and content

Highlights

  • Perform clustering and classification sequentially on two different datasets for preliminary data analysis.

  • Reveal the hidden structure of the dataset representing the main findings or performance measures and find the data labels based on clustering task.

  • Exploit the correlation between the revealed labels and the features on another dataset based on classification task.

  • A novel chromosome setting is developed to perform feature selections on both clustering and classification.

  • Balance the solution quality of clustering and classification simultaneously by Non-Dominated Sorting Genetic Algorithm.

Abstract

This research proposes an innovative framework that can be used as a preliminary data analysis tool when labels of data instances are not available during the early stage of the process. The preliminary data analysis usually starts from exploring “target interest” features, which can be the measures representing the performances or the decision attributes. Then, investigating the factors that are highly correlated with the “target interest” features is the major analysis task. Because no exact labels are provided, these data exploration and investigation processes are iterative and time-consuming, especially when the size of data is huge. This research proposes the framework, named NSGAII-SCC, to form the multi-objective problem of combining clustering for “target interest” exploration with a classification algorithm for factor investigation, sequentially. The fast and elitist non-dominated sorting genetic algorithm (NSGAII) integrated with a feature selection mechanism is designed to search for a better solution for clustering and classification. This sequential clustering and classification process aims to not only reveal the hidden patterns of “target interest” but also explore the features that are highly correlated with the discovered patterns. Two public transactional datasets from Kaggle were used to evaluate the performance of NSGAII-SCC. The experimental result shows that NSGAII-SCC achieves a promising performance for finding better solutions that maintain the multi-objectives of clustering and classification. Additionally, the feature selection using the chromosome settings can help to search for the relevant features for both clustering and classification learnings. The proposed framework is particularly useful as a tool to investigate big transactional data.

Introduction

Data clustering and classification are two major data mining techniques, which have been applied in many areas for decades. Data clustering, considered an unsupervised learning method, aims to identify groups of data objects based on similarity measures [1]. Owing to the nature of discovering the latent patterns and correlations among data points, data clustering has been applied in various real-life fields such as image processing, statistics, biology, business, and social science to investigate the hidden relationship among data points. By contrast, data classification aims to classify a data object with an unknown label into a pre-determined group, which consists of a set of pre-classified objects with similar features [1]. Essentially, the classifiers are trained with a labeled dataset where the concerned identification of each data point has been specified. Therefore, data classification is referred to as a supervised learning method and has been widely used in pattern recognition, machine learning, and artificial intelligence domains with multiple applications.

Although data clustering and classification methods can be applied individually, they are often utilized together for data exploration or data analytics projects, especially when data labels have not been defined. Assume a data analyst obtains a set of data D, which has n data points with p data attributes (D is a n × p matrix) for data analytics. Within these p attributes, the dependent and independent variables based on the objective of the analytics should be identified. During the initial stage of the data exploration, the analyst might only have very a rough idea about which data features in D might be their “target interests.” Here, the “target interests” could be the measures representing the performances of an individual or a unit, or the decision attributes conducted by human or machines to show the operating results or findings. Sometimes, determining the “target interests” might be based solely on the analyst’s assumption of the possible outcomes related to data instances. At this moment, without the given labels of data points for classification, the clustering method can be applied to the selected data attributes as “target interests” to explore the possible patterns among the data instances. Then, whatever findings from the clustering results might be the “clues” of labeling the data points for data classification afterward.

The investigation of the selected “target interests” should be reviewed thoroughly to check whether any pattern exists among the attributes of “target interests.” In this research, the dataset Q, a subset of D (QD, Q is a n × q matrix) with a fewer number of attributes q, where q < p is defined as a dataset containing the “target interest” attributes. Usually, Q consists of attributes that are selected from D to represent the dependent variables of the data analysis. The remaining data attributes in D, specified as a dataset X (XD, X is a n × x matrix), where x = p − q is the data domain of “factors” for data exploration such as multivariate analysis. The statistical analysis or machine learning modeling can be performed to check whether a correlation exists between the “factor” attributes in X and the found patterns of “target interests.” Additionally, both Q and X are subsets of D, but they are not overlapped (QX = ∅) because each of them represents the dependent and independent variables (or attributes), respectively, for data analysis.

For example, an education institute wants to investigate students’ learning performance and study which factors highly affect students’ learning. Since the “target interests” are the students’ performance, for each student, the grades for all of the homework, the mid-term exam, final exam, quizzes, and projects of multiple subjects can be collected as a dataset Q to study the learning performance. Dataset Q can be extremely big, and the traditional GPA is too limited to specify the distinctive learning performances of the students. The clustering method can be applied to Q to partition groups of students with similar learning performances. The clustering result might be able to present the grade distributions of students, such as a group of students who have relatively higher grades on scientific subjects and lower grades on liberal studies. The clustering labels can be used to segment students with distinctive learning performances. Then, the analytical process will continue to study which factors might influence or highly correlate with the student’s “label” specified by the clustering. In this case, dataset X, containing each student’s information such as the average hours of studying, the aptitude test result, gender, age, demographic measures, can be investigated to identify which factors correlate highly with the student’s performance. Feature selection and data classification method can be applied to dataset X to investigate the attributes on X and even predict the student’s performance based on the selected attributes.

Utilizing clustering and classification on datasets Q and X, which are different datasets for the data analysis, is an iterative and time-consuming data exploration process, especially when the size of the data is huge. Multiple decisions should therefore be made, such as the number of clusters, which features in Q and X are significant for clustering and classification, respectively, and which clustering and classification methods should be used. The optimality of combining clustering on Q with classification on X can be defined as the result showing a high classification accuracy when determining which features in X can be used to explain or predict the discovered labels. Note that the good clustering result on Q might not necessarily lead to the good classification outcome on predicting L based on X. The balance of optimizing the clustering and classification for data exploration is the goal of this research.

This research proposes a new data analysis framework, named NSGAII-SCC, which utilizes a fast and elitist non-dominated sorting genetic algorithm (NSGAII) to combine clustering and classification algorithms sequentially to investigate Q and X. The novelty of this framework is that the clustering and classification are not conducted on the same dataset. In addition, the feature selection process embedded in the genetic chromosome setting is integrated within the clustering and classification to iteratively search the significant features for improving the clustering and classification results simultaneously. Through an iterative process, NSGA-II is utilized to select candidate solutions of clustering and classification for generations. The multiple results on the Pareto front under NSGA-II imply possible correlations between Q and X during the preliminary data analysis.

The remaining of this paper is organized as follows. Section 2 lists the literature reviews of multivariate analysis on Q and X, the existing algorithms for combining clustering and classification, and the multi-objective optimization with NSGA. Section 3 describes the proposed framework, NSGAII-SCC, in detail. Section 4 presents the experimental results using the public dataset from Kaggle to evaluate the performance of the combinations of multiple clustering and classification methods under the proposed framework. The proposed framework is also compared with other frameworks. Finally, Section 5 concludes the research remarks and discussion.

Section snippets

Literature review

At first glance, the traditional multivariate data analysis tools such as multivariate analysis of variance (MANOVA), multivariate analysis of covariance (MANCOVA) or Canonical Correlation Analysis (CCA) seem to fit well for analyzing the relationship between Q and X specified above [2]. The linear models such as MANOVA and MANCOVA have been used widely to study the potential factors or covariates with multiple dependent variables. However, the assumptions of multivariate normality, independent

Methodology

We describe the data clustering and classification mentioned above as two functions denoted as Ω and Φ, respectively. The clustering Ω takes dataset Q as the input and generates labels L as the output, formulated as L = Ω(Q). The classifier Φ then uses X and L to train a model, which is able to predict L by using X. If the predicted label is denoted by Lˆ, the classifier can be formed as Lˆ= Φ(X|L), where Φ takes X as the input with the given L. Combining functions Ω and Φ, the formula Lˆ= Φ(

Dataset and parameter setting

Two public datasets, Rossmann and Wal-Mart sales data, from Kaggle competition (https://www.kaggle.com), were used in the experiment to represent the aforementioned Q and X datasets. Both datasets contain large transactional sales records of multiple retailing stores, which can be collected from multiple point-of-sale systems. The Rossmann sales data contains 1,017,209 data points, which were collected from the daily sales of 1115 stores from January 2013 to July 2015. The Wal-Mart sales data

Conclusion

This research proposed a framework named NSGAII-SCC, which combines clustering and classification methods sequentially, to fulfill the need of performing the preliminary data analysis. The innovation of this research is to separate the dataset where the data label has not been defined for clustering and classification sequentially. When the number of attributes and the size of the dataset are large, this framework helps to guide the data analysis task by segmenting the data to be “target

Acknowledgements

The authors would like to thank the reviewers’ comments for improving this work. The authors also appreciate the financial support from Ministry of Science and Technology of Taiwan, R.O.C. (Contract No: NSC 101-2221-E-011-057 and 106-2221-E-011 −106 −MY3).

References (35)

  • K. Josien et al.

    Integrated use of fuzzy c-means and fuzzy KNN for GT part family and machine cell formation

    Int. J. Prod. Res.

    (2000)
  • H.-J. Zeng et al.

    CBC: Clustering based text classification requiring minimal labeled data

  • N. Kaewchinporn

    A combination of decision tree learning and clustering for data classification

    International Joint Conference on Computer Science and Software Engineering

    (2011)
  • X.-Y. Zhang et al.

    Combination of classification and clustering results with label propagation

    Signal Process. Lett. IEEE

    (2014)
  • L.F. Coletta et al.

    Combining classification and clustering for tweet sentiment analysis

  • T. Finley et al.

    Supervised k-means Clustering

    (2008)
  • A. Mukhopadhyay et al.

    Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification

    PLoS One

    (2010)
  • Cited by (13)

    View all citing articles on Scopus
    View full text