The data complexity index to construct an efficient cross-validation method

doi:10.1016/j.dss.2010.07.005

Decision Support Systems

Volume 50, Issue 1, December 2010, Pages 93-102

https://doi.org/10.1016/j.dss.2010.07.005 Get rights and content

Abstract

Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.

Introduction

In data mining applications, researchers generally use cross-validation to evaluate the learned classification model [11]. However, this usually requires considerable computational costs. With K-fold cross-validation, for example, the number of experiment runs must increase when parameter K increases, making the training computationally expensive [1]. Specifically, ((K − 1)/K)% training data are theoretically needed for learning a classification model, and when the data size is very large, ((K − 1)/K)% training data makes computation expensive [1].

In another common scenario, repeated random sub-sampling validation is usually repeated 30 or 50 times for model evaluation [23]. However, if the data structure is simple or uniform, the number of times sub-sampling validation is repeated is much more than what is needed, and thus the procedure is inefficient.

Our research develops an effective cross-validation procedure, called Complexity-based Efficient (CBE) cross-validation, for binary classification problems. The CBE cross-validation method can be used to calculate the optimal training data size and the number of experiment runs to reduce model validation time. The CBE cross-validation procedure systematically establishes a non-linear data complexity index (defined in Section 3) called CBE index by exploring the geometric structure and noise of data.

The density-based clustering algorithm (DBSCAN) is used to discover the geometric structure and noise, while the between-distance and within-distance of the clusters found are used as the factors of the CBE index. Based on this, this research develops an efficient CBE cross-validation procedure to calculate the optimal training data size and number of experiment runs.

The rest of this paper is organized as follows: The literature review is given in Section 2 while the detailed procedure of the proposed method is described in Section 3. One simulated and three real data sets are used to illustrate the CBE cross-validation model in Section 4, and Section 5 contains the conclusion and discussion of our research.

Section snippets

Literature review

In this section we review the concept of linear data complexity (the definition is explained in Section 3), the geometric structure and noise of data, and existing cross-validation methods.

Proposed method

With binary classification problems, data complexity is defined as the level of complexity for separating data into classes. When the data complexity is high this means it is hard to classify. Complexities can be subdivided into linear and non-linear cases: linear data complexity means a complex level for separating the data using a linear hyperplane; while non-linear data complexity means a complex level for separating the data using a non-linear hyperplane. Taking the XOR problem as an

Experiment

In this section, we use one simulated and three real data sets to verify the performance of the Complexity-based Efficient (CBE) cross-validation method. In the simulation experiments, a support vector machine (SVM) [12], a Back-propagation Network (BPN) [8], [20], and a Naive Bayes Classifier (NBC) [24] are used as the classification tools, while in the three real data sets, only SVM is used.

To find the relationship between CBE index and classification accuracy, we randomly select 10% of the

Conclusion and discussions

Our research develops an efficient and effective cross-validation method called Complexity-based Efficient (CBE) cross-validation. The CBE cross-validation uses the CBE index (calculated by exploring the data's geometric structure and noise) to precisely discover the data's characteristics and its non-linear complexity, in order to help understand the data set. We also employ the CBE index to calculate the optimal training data size and number of experiment runs. CBE cross-validation aims to

Der-Chiang Li is a Distinguished Professor in the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his Ph.D. degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on learning with small data sets.

References (24)

L.J. Cao et al.
Modified support vector novelty detector using training data with outliers
Pattern Recognition Letters
(2003)
M. Daszykowski et al.
Looking for natural patterns in data part 1. density-based approach
Chemometrics and Intelligent Laboratory Systems
(2001)
M. Daszykowski et al.
Representative subset selection
Analytica Chimica Acta
(2002)
H. Han et al.
Using the revised EM algorithm to remove noisy for improving the one-against-the-rest method in binary text classification
Information Processing and Management
(2007)
M.Y. Hu et al.
Modeling consumer situational choice of long distance communication with neural networks
Decision Support Systems
(2008)
E.W.M. Lee et al.
Application of a noisy classification technique to determine the occurrence of flashover in compartment fires
Advanced Engineering Informatics
(2006)
D.C. Li et al.
An algorithm to cluster data for efficient classification of support vector machines
Expert Systems with Applications
(2008)
D.C. Li et al.
A non-linearly virtual sample generation technique using cluster discovery and parametric equations of hypersphere
Expert Systems with Applications
(2009)
S. Piramuthu et al.
A classification approach using multi-layered neural networks
Decision Support Systems
(1994)
A.M. Rubinov et al.
Classes and clusters in data analysis
European Journal of Operational Research
(2006)

C.M. Bishop

G. Casella et al.

Cited by (14)

DBPboost:A method of classification of DNA-binding proteins based on improved differential evolution algorithm and feature extraction
2024, Methods
DNA-binding proteins are a class of proteins that can interact with DNA molecules through physical and chemical interactions. Their main functions include regulating gene expression, maintaining chromosome structure and stability, and more. DNA-binding proteins play a crucial role in cellular and molecular biology, as they are essential for maintaining normal cellular physiological functions and adapting to environmental changes. The prediction of DNA-binding proteins has been a hot topic in the field of bioinformatics. The key to accurately classifying DNA-binding proteins is to find suitable feature sources and explore the information they contain. Although there are already many models for predicting DNA-binding proteins, there is still room for improvement in mining feature source information and calculation methods. In this study, we created a model called DBPboost to better identify DNA-binding proteins. The innovation of this study lies in the use of eight feature extraction methods, the improvement of the feature selection step, which involves selecting some features first and then performing feature selection again after feature fusion, and the optimization of the differential evolution algorithm in feature fusion, which improves the performance of feature fusion. The experimental results show that the prediction accuracy of the model on the UniSwiss dataset is 89.32%, and the sensitivity is 89.01%, which is better than most existing models.
Weighted fuzzy interpolative reasoning for sparse fuzzy rule-based systems based on piecewise fuzzy entropies of fuzzy sets
2016, Information Sciences
Citation Excerpt :
It should be noted that Govindarajan and Chandrasekaran [32] have pointed out that the advantage of the repeated random sub-sampling cross-validation method over the k-fold cross-validation method [4] is that the proportion of the training/validation split is not dependent on the number of iterations (folds). Therefore, in this paper, we adopt the repeated random sub-sampling cross-validation method [49] for the experiments of the multivariate regression problems [29], the Mackey–Glass chaotic time series prediction problem [28] and the time series prediction problems [6,40], where we let α = 0.1 and let T = 50. In the following, we apply the proposed weighted fuzzy interpolative reasoning method to deal with the multivariate regression problems, including the abalone problem [29], the concrete compressive strength problem [29] and the concrete slump test problem [29].
In this paper, we propose a new method for weighted fuzzy interpolative reasoning in sparse fuzzy rule-based systems based on piecewise fuzzy entropies of fuzzy sets. First, the proposed method uses the representative values of antecedent fuzzy sets, the representative values of observation fuzzy sets, and the representative values of consequence fuzzy sets of fuzzy rules to get the characteristic points of the fuzzy interpolative result represented by a fuzzy set. Then, it calculates the piecewise fuzzy entropies between any two characteristic points of the antecedent fuzzy sets, the piecewise fuzzy entropies between any two characteristic points of the observation fuzzy sets, and the piecewise fuzzy entropies between any two characteristic points of the consequence fuzzy sets of the fuzzy rules, respectively. Then, it calculates the weights of the antecedent fuzzy sets of each fuzzy rule, respectively, and calculates the weight of each fuzzy rule. Then, it calculates the piecewise fuzzy entropies between any two characteristic points of the fuzzy interpolative result. Finally, it uses the secant method to calculate the degree of membership of each obtained characteristic point of the fuzzy interpolative result. The experimental results show that the proposed method outperforms the existing methods for dealing with the multivariate regression problems, the Mackey–Glass chaotic time series prediction problem, and the time series prediction problems.
Predictive modelling of survival and length of stay in critically ill patients using sequential organ failure scores
2015, Artificial Intelligence in Medicine
Citation Excerpt :
Additionally, the models are tested on retrospective data as if it were live patient data, taking into account patient data not only from the first five days, but from the previous five to make moving window predictions. For training the different machine learning models, repeated random sub-sampling validation (RRSSV) [26], in which the dataset is split n times in a training set (60%) and a validation set (40%), is used. Over these n splits, the average or median of the measured values (e.g., median offset or average recall) is computed.
The length of stay of critically ill patients in the intensive care unit (ICU) is an indication of patient ICU resource usage and varies considerably. Planning of postoperative ICU admissions is important as ICUs often have no nonoccupied beds available.
Estimation of the ICU bed availability for the next coming days is entirely based on clinical judgement by intensivists and therefore too inaccurate. For this reason, predictive models have much potential for improving planning for ICU patient admission.
Our goal is to develop and optimize models for patient survival and ICU length of stay (LOS) based on monitored ICU patient data. Furthermore, these models are compared on their use of sequential organ failure (SOFA) scores as well as underlying raw data as input features.
Different machine learning techniques are trained, using a 14,480 patient dataset, both on SOFA scores as well as their underlying raw data values from the first five days after admission, in order to predict (i) the patient LOS, and (ii) the patient mortality. Furthermore, to help physicians in assessing the prediction credibility, a probabilistic model is tailored to the output of our best-performing model, assigning a belief to each patient status prediction. A two-by-two grid is built, using the classification outputs of the mortality and prolonged stay predictors to improve the patient LOS regression models.
For predicting patient mortality and a prolonged stay, the best performing model is a support vector machine (SVM) with G_A,D = 65.9% (area under the curve (AUC) of 0.77) and G_S,L = 73.2% (AUC of 0.82). In terms of LOS regression, the best performing model is support vector regression, achieving a mean absolute error of 1.79 days and a median absolute error of 1.22 days for those patients surviving a nonprolonged stay.
Using a classification grid based on the predicted patient mortality and prolonged stay, allows more accurate modeling of the patient LOS. The detailed models allow to support the decisions made by physicians in an ICU setting.
Generating information for small data sets with a multi-modal distribution
2014, Decision Support Systems
Citation Excerpt :
Companies can gain a competitive advantage by speedily providing new products, but when these are in the pilot run stage there is generally only a small amount of data that can be used to improve their performance, due to financial and time limitations. It is thus important to develop analysis methods for use with small data sets, in order to achieve better classification performance [19,20,23,25,28]. Many approaches have been proposed to deal with this issue, with, for example Das and Nenadic [8] and Xu et al. [33] creating algorithms for certain data sets.
Virtual sample generation approaches have been used with small data sets to enhance classification performance in a number of reports. The appropriate estimation of data distribution plays an important role in this process, with performance usually better for data sets that have a simple distribution rather than a complex one. Mixed-type data sets often have a multi-modal distribution instead of a simple, uni-modal one. This study thus proposes a new approach to detect multi-modality in data sets, to avoid the problem of inappropriately using a uni-modal distribution. We utilize the common k-means clustering method to detect possible clusters, and, based on the clustered sample sets, a Weibull variate is developed for each of these to produce multi-modal virtual data. In this approach, the degree of error variation in the Weibull skewness between the original and virtual data is measured and used as the criterion for determining the sizes of virtual samples. Six data sets with different training data sizes are employed to check the performance of the proposed method, and comparisons are made based on the classification accuracies. The results using non-parametric testing show that the proposed method has better classification performance to that of the recently presented Mega-Trend-Diffusion method.
Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency
2014, Decision Support Systems
Citation Excerpt :
This paper sets the training data size NT forward in order as 20, 40, 60, 80, 100, 125, and 150 in the UCI datasets. The determination of virtual sample size is based on Li et al. [12], which stated that too many virtual samples would decrease the learning accuracy. In order to verify the improved predictive accuracy achieved with small dataset learning in this work, the related experiment is carried out by the following three steps:
Small-data problems are commonly encountered in the early stages of a new manufacturing procedure, presenting challenges to both academics and practitioners, as good performance is difficult to achieve with learning models when there is a lack of sufficient data. Virtual sample generation (VSG) has been shown to be an effective method to overcome this issue in a wide range of studies in various fields. Such works usually assume that the relations among attributes are independent of each other, and produce synthetic data by using sample distributions of these. However, the VSG technique may be ineffective if the real data has interrelated attributes. Therefore, this research provides a novel procedure to generate related virtual samples with non-linear attribute dependency. To construct a relational model between the independent and dependent attributes, we employ gene expression programming (GEP) to find the most suitable mathematical model. One practical dataset and three real UCI datasets are presented in this paper to verify the effectiveness of the proposed method, and the results show that the proposed approach has better learning accuracy with regard to a back-propagation neural (BPN) network than that of the well-known mega-trend-diffusion (MTD) and the multi regression analysis (MRA) approaches.
Using structure-based data transformation method to improve prediction accuracies for small data sets
2012, Decision Support Systems
Citation Excerpt :
If the ε-neighborhood of a data point contains other data which has a data size that is more than a certain pre-defined number (Minpts), a cluster with this data (called the core object) is created; otherwise, the data is treated as noise which will be eventually deleted. DBSCAN iteratively collects directly density-reachable data (data within the ε-neighborhood of a core object) until no new data can be added to any cluster, and this may involve merging some items [19]. In this study, we apply the DBSCAN algorithm to cluster overall data sets to detect the data structures and noise.
Small data set problems have been widely considered in many fields, where increasing the prediction ability is the most important goal. This study considers the data structure to identify new data points in a more precise manner, and is thus able to achieve improved prediction capability. The proposed method, named structure-based data transformation, consists of two steps. The first step is using the density-based spatial clustering of applications with noise (DBSCAN) algorithm to separate data sets into clusters, which generates the number of clusters dynamically. The second step is to build up the data transformation function, in which the new attributes are computed using fuzzy membership functions obtained by the corresponding membership grades in each cluster. Three real cases are selected to compare the proposed forecasting model with the linear regression (LR), backpropagation neural network (BPNN), and support vector machine for regression (SVR) methods. The result show that the structure-based data transformation method has better performance than when using the raw data with regard to the error improving rate, mean square error (MSE), and standard deviation (STD).

View all citing articles on Scopus

Yao-Hwei Fang is a postdoctoral fellow in the Division of Biostatistics and Bioinformatics, National Health Research Institutes. He is working at the laboratory for statistical analysis of human genetic. He received his Ph.D. at the Department of Industrial and Information Management at National Cheng Kung University, Taiwan, in 2009.

Y.M. Frank Fang obtained his PhD degree from the Department of Civil and Hydraulic Engineering, Feng Chia University in 2006. Before he joined the Department of Civil and Hydraulic Engineering of Feng Chia University (FCU) in 2006, he worked as a post doctoral researcher in Geographic Information Systems Research Center, Feng Chia University. Currently, Assistant Professor Fang is Chief Researcher of Geographic Information Systems Research Center, FCU. His research interests include disaster Monitoring and civil engineering.

View full text

The data complexity index to construct an efficient cross-validation method

Abstract

Introduction

Section snippets

Literature review

Proposed method

Experiment

Conclusion and discussions

Pattern Recognition Letters

Chemometrics and Intelligent Laboratory Systems

Analytica Chimica Acta

Information Processing and Management

Decision Support Systems

Advanced Engineering Informatics

Expert Systems with Applications

Expert Systems with Applications

Decision Support Systems

European Journal of Operational Research