Elsevier

Decision Support Systems

Volume 50, Issue 1, December 2010, Pages 93-102
Decision Support Systems

The data complexity index to construct an efficient cross-validation method

https://doi.org/10.1016/j.dss.2010.07.005Get rights and content

Abstract

Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.

Introduction

In data mining applications, researchers generally use cross-validation to evaluate the learned classification model [11]. However, this usually requires considerable computational costs. With K-fold cross-validation, for example, the number of experiment runs must increase when parameter K increases, making the training computationally expensive [1]. Specifically, ((K  1)/K)% training data are theoretically needed for learning a classification model, and when the data size is very large, ((K  1)/K)% training data makes computation expensive [1].

In another common scenario, repeated random sub-sampling validation is usually repeated 30 or 50 times for model evaluation [23]. However, if the data structure is simple or uniform, the number of times sub-sampling validation is repeated is much more than what is needed, and thus the procedure is inefficient.

Our research develops an effective cross-validation procedure, called Complexity-based Efficient (CBE) cross-validation, for binary classification problems. The CBE cross-validation method can be used to calculate the optimal training data size and the number of experiment runs to reduce model validation time. The CBE cross-validation procedure systematically establishes a non-linear data complexity index (defined in Section 3) called CBE index by exploring the geometric structure and noise of data.

The density-based clustering algorithm (DBSCAN) is used to discover the geometric structure and noise, while the between-distance and within-distance of the clusters found are used as the factors of the CBE index. Based on this, this research develops an efficient CBE cross-validation procedure to calculate the optimal training data size and number of experiment runs.

The rest of this paper is organized as follows: The literature review is given in Section 2 while the detailed procedure of the proposed method is described in Section 3. One simulated and three real data sets are used to illustrate the CBE cross-validation model in Section 4, and Section 5 contains the conclusion and discussion of our research.

Section snippets

Literature review

In this section we review the concept of linear data complexity (the definition is explained in Section 3), the geometric structure and noise of data, and existing cross-validation methods.

Proposed method

With binary classification problems, data complexity is defined as the level of complexity for separating data into classes. When the data complexity is high this means it is hard to classify. Complexities can be subdivided into linear and non-linear cases: linear data complexity means a complex level for separating the data using a linear hyperplane; while non-linear data complexity means a complex level for separating the data using a non-linear hyperplane. Taking the XOR problem as an

Experiment

In this section, we use one simulated and three real data sets to verify the performance of the Complexity-based Efficient (CBE) cross-validation method. In the simulation experiments, a support vector machine (SVM) [12], a Back-propagation Network (BPN) [8], [20], and a Naive Bayes Classifier (NBC) [24] are used as the classification tools, while in the three real data sets, only SVM is used.

To find the relationship between CBE index and classification accuracy, we randomly select 10% of the

Conclusion and discussions

Our research develops an efficient and effective cross-validation method called Complexity-based Efficient (CBE) cross-validation. The CBE cross-validation uses the CBE index (calculated by exploring the data's geometric structure and noise) to precisely discover the data's characteristics and its non-linear complexity, in order to help understand the data set. We also employ the CBE index to calculate the optimal training data size and number of experiment runs. CBE cross-validation aims to

Der-Chiang Li is a Distinguished Professor in the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his Ph.D. degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on learning with small data sets.

References (24)

  • C.M. Bishop
  • G. Casella et al.
  • Cited by (14)

    • Weighted fuzzy interpolative reasoning for sparse fuzzy rule-based systems based on piecewise fuzzy entropies of fuzzy sets

      2016, Information Sciences
      Citation Excerpt :

      It should be noted that Govindarajan and Chandrasekaran [32] have pointed out that the advantage of the repeated random sub-sampling cross-validation method over the k-fold cross-validation method [4] is that the proportion of the training/validation split is not dependent on the number of iterations (folds). Therefore, in this paper, we adopt the repeated random sub-sampling cross-validation method [49] for the experiments of the multivariate regression problems [29], the Mackey–Glass chaotic time series prediction problem [28] and the time series prediction problems [6,40], where we let α = 0.1 and let T = 50. In the following, we apply the proposed weighted fuzzy interpolative reasoning method to deal with the multivariate regression problems, including the abalone problem [29], the concrete compressive strength problem [29] and the concrete slump test problem [29].

    • Predictive modelling of survival and length of stay in critically ill patients using sequential organ failure scores

      2015, Artificial Intelligence in Medicine
      Citation Excerpt :

      Additionally, the models are tested on retrospective data as if it were live patient data, taking into account patient data not only from the first five days, but from the previous five to make moving window predictions. For training the different machine learning models, repeated random sub-sampling validation (RRSSV) [26], in which the dataset is split n times in a training set (60%) and a validation set (40%), is used. Over these n splits, the average or median of the measured values (e.g., median offset or average recall) is computed.

    • Generating information for small data sets with a multi-modal distribution

      2014, Decision Support Systems
      Citation Excerpt :

      Companies can gain a competitive advantage by speedily providing new products, but when these are in the pilot run stage there is generally only a small amount of data that can be used to improve their performance, due to financial and time limitations. It is thus important to develop analysis methods for use with small data sets, in order to achieve better classification performance [19,20,23,25,28]. Many approaches have been proposed to deal with this issue, with, for example Das and Nenadic [8] and Xu et al. [33] creating algorithms for certain data sets.

    • Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency

      2014, Decision Support Systems
      Citation Excerpt :

      This paper sets the training data size NT forward in order as 20, 40, 60, 80, 100, 125, and 150 in the UCI datasets. The determination of virtual sample size is based on Li et al. [12], which stated that too many virtual samples would decrease the learning accuracy. In order to verify the improved predictive accuracy achieved with small dataset learning in this work, the related experiment is carried out by the following three steps:

    • Using structure-based data transformation method to improve prediction accuracies for small data sets

      2012, Decision Support Systems
      Citation Excerpt :

      If the ε-neighborhood of a data point contains other data which has a data size that is more than a certain pre-defined number (Minpts), a cluster with this data (called the core object) is created; otherwise, the data is treated as noise which will be eventually deleted. DBSCAN iteratively collects directly density-reachable data (data within the ε-neighborhood of a core object) until no new data can be added to any cluster, and this may involve merging some items [19]. In this study, we apply the DBSCAN algorithm to cluster overall data sets to detect the data structures and noise.

    View all citing articles on Scopus

    Der-Chiang Li is a Distinguished Professor in the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his Ph.D. degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on learning with small data sets.

    Yao-Hwei Fang is a postdoctoral fellow in the Division of Biostatistics and Bioinformatics, National Health Research Institutes. He is working at the laboratory for statistical analysis of human genetic. He received his Ph.D. at the Department of Industrial and Information Management at National Cheng Kung University, Taiwan, in 2009.

    Y.M. Frank Fang obtained his PhD degree from the Department of Civil and Hydraulic Engineering, Feng Chia University in 2006. Before he joined the Department of Civil and Hydraulic Engineering of Feng Chia University (FCU) in 2006, he worked as a post doctoral researcher in Geographic Information Systems Research Center, Feng Chia University. Currently, Assistant Professor Fang is Chief Researcher of Geographic Information Systems Research Center, FCU. His research interests include disaster Monitoring and civil engineering.

    View full text