Elsevier

Information Sciences

Volume 477, March 2019, Pages 246-264
Information Sciences

Block-regularized repeated learning-testing for estimating generalization error

https://doi.org/10.1016/j.ins.2018.10.040Get rights and content

Abstract

Repeated learning-testing (RLT) is a popular cross-validation method that is usually adopted in estimation and comparison of algorithm performance in machine learning. However, the variance of the estimator of the generalization error in the RLT method is easily affected by random partitions. Poor data partitioning may cause a large overlap between any two training sets of RLT and enlarge the variance in the RLT estimator of the generalization error. Thus, in this study, we constrain numbers of overlapping samples between any two training sets and construct a novel data partitioning schema in which RLT is called the block-regularized RLT (BRLT). We theoretically prove that the BRLT estimator has a smaller variance than does the RLT estimator. Specifically, the variance of the RLT estimator reaches its minimum when samples in a data set equally occur in all training sets and all the numbers of overlapping samples are equal. Furthermore, we provide easy-to-use construction algorithms of BRLT’s partition set for several training set size and partition count settings by adopting two-level orthogonal arrays. We also illustrate BRLT’s optimal properties with several simulated and real-world examples.

Introduction

Repeated learning-testing (RLT) randomly splits a data set several times to construct multiple training-validation partitions and induces performance estimation of a machine learning model for statistical inference by averaging on the partitions [6], [8]. In this process, each partition of the data set corresponds to a hold-out validation. Thus, the RLT method is also known as repeated hold-outs [15] or random cross-validation [1].

In machine learning, RLT is usually adopted to estimate the generalization error of an algorithm, and an RLT estimator is commonly used in model selection, variable screening and comparison of algorithm performance [14], [15], [25]. In a small-scale data set, RLT sufficiently reuses all samples by partitioning a data set multiple times. Thus, the RLT estimator of generalization error with high accuracy is a better alternative to the resubstitution [13] and hold-out estimators [4]. Moreover, the advantage of RLT is that its partition count and training set size can be flexibly adjusted compared with the immensely popular standard K-fold cross-validation (KFCV), whose training set size must be a specific proportion (i.e., K1/K) of the entire data set, where K is the partition count. The constrained partition count of KFCV might lead to a large internal variance in the KFCV estimator of the generalization error [11], [15], especially on small-sample classification problems [5]. This internal variance can be reduced by adopting RLT with sufficient partitions on a fixed training set size. A typical example is on text data sets, in which the KFCV estimators of model performance still possess large variances [9], [19]. Several data replications based on a fixed training set size should be added to obtain a stable estimator. Then, an RLT method with multiple data partitions can be used.

However, the performance of the RLT method frequently relies on the quality of data partitioning and accuracy (variance) of the RLT estimator of the generalization error. In RLT, all partitions on a data set are randomly and independently generated. The corresponding training sets from any two partitions contain common (overlapping) samples. Excessive numbers of overlapping samples introduce unnecessary correlations in the RLT estimator and enlarge the variance. Ref[33] illustrated this case through several simulations. The covariance of the corresponding two hold-out estimators increases and the variance of the RLT estimator becomes large with the increase in the number of overlapping samples between any two training sets. This phenomenon is formally analyzed in following sections of this paper by obtaining the number of overlapping samples between the training sets as a measure to determine the quality of a partition set.

The effectiveness of using the number of overlapping samples as a measure is confirmed in an m × 2 cross-validation combined with the birth of block-regularized m × 2 cross-validation [33], [34]. For RLT, the number of overlapping samples is still an effective measure. However, RLT possesses random counts of occurrences of samples appearing in all training sets, unlike m × 2 cross-validation. The random counts affect the total sum of the numbers of overlapping samples (See Section 3 for the formal analysis.) and cause a fluctuation in the variance of the RLT estimator. Hence, the numbers of overlapping samples and the occurrence counts of samples in training sets should be balanced to determine a good partition set for RLT.

The improvement in the variance of RLT estimators requires a rigorous theoretical investigation of the variance. This investigation has been performed in many studies [1], [8], [23], [25], [28], and several improvements have also been conducted [1]. Burman provided an asymptotic expansion of the variance of RLT estimators and showed that the variance of an RLT estimator decreases with an increase in repetitions [8]. Nadeau and Bengio investigated the correlation in the variance of RLT estimators [25]. They provided a theoretical decomposition on the variance of RLT estimators, and illustrated that covariances among the hold-out estimators dominate the variance of an RLT estimator, especially when repetition tends to infinity. Interestingly, they revealed that the variance does not increase with an increase in test set size and indicated that no universal unbiased estimator of variance of an RLT estimator exists. Markatou et al. [23] conducted a theoretical analysis of the variance of RLT estimators in terms of moments. Their work provided a theoretical basis for the proposed method in the present study. They derived the distribution of the number of overlapping samples between any two training sets in an RLT and theoretically elaborated the relationship between overlapping samples and variance of RLT estimators in a general family of loss functions. Moreover, Rodríguez et al. [28] decomposed the variance of RLT into two parts in terms of sensitivity, namely, internal and external sensitivities. The two parts correspond to the variances incurred by changes in partition and data sets, respectively. They numerically showed that external sensitivity is an important part of the variance of RLT estimators. Recently, Afendras and Markatou [1] made several improvements on the RLT method. They answered two important questions in conducting RLT estimators, namely, the selection of optimal repetition count and training set size to derive a stable RLT estimator based on a fixed data set size. In a broad class of loss functions, they proposed several useful selection criteria, such as pi-effectiveness and r-reduction, and developed a wealth of theoretical results and an efficient algorithm to determine the optimal repetition counts and training set size. An interesting conclusion in their work was that the optimal training set size should be half the data set size. The conclusion provided an evidence for using half-sampling methods, such as, 5 × 2 cross-validation [10] and block-regularized m × 2 cross-validation [33], in model estimation and selection, and strengthened the importance of the proposed block-regularized repeated half-sampling (BRHS) in the current work (Section 3.2).

In this study, we construct an optimal design of partition set for RLT to reduce the variance of RLT estimators under a given sample size, partition count and training set size. The optimal design requires reasonable allocation of samples in the partition set of RLT. This allocation prevents excessive overlapping samples, which will enlarge the variance of RLT estimator. To achieve our aim, we present a novel version of RLT named block-regularized RLT (BRLT) by constraining the number of overlapping samples. We prove that the BRLT estimator of generalization error has a smaller variance than does the RLT estimator. Specifically, we theoretically show that the variance of the RLT estimator reaches its minimum when its partition set satisfies two constraints. (1) Each sample in a data set occurs with the same counts in all training sets, and (2) all the numbers of overlapping samples are equal. Therefore, BRLT can be regarded as an ideal version of cross-validation that is more flexible than KFCV. We provide several easy-to-use construction algorithms for the partition set of BRLT in practical situations. Overall, this work offers the following major contributions.

  • For any two hold-out estimators in an RLT estimator, we prove that their covariance is a convex function with regard to the number of overlapping samples between the two corresponding training sets.

  • We formulate the variance minimization problem of an RLT estimator with a constraint on the number of overlapping samples, and we propose the BRLT method, in which the estimator of the generalization error has a smaller variance than does the RLT estimator.

  • We develop three efficient construction methods of BRLT for three different situations of training size and partition count. Specifically, we provide an easy-to-use construction for block-regularized repeated half-sampling (BRHS) that is based on two-level orthogonal arrays (OA) and theoretically prove its minimal variance property.

The rest of this paper is organized as follows. Section 2 provides the notations and preliminaries, summarizes the necessary conclusions of variance analysis on RLT estimators from previous studies [1], [23], [25], and illustrates the convex property of the covariance function. Section 3 presents the novel BRLT method and theoretically illustrates that the BRLT estimator of generalization error possesses a smaller variance than does the RLT estimator. Furthermore, we provide several construction algorithms for the partition set of BRLT. Section 4 elaborates the experimental data sets and settings and Section 5 presents the experimental results and analysis. Section 6 provides the conclusion.

Section snippets

Notations and preliminaries

Assume that data set Dn={zi:zi=(xi,yi),i=1,,n} consists of n iid samples drawn from unknown distribution P, where xiRp is a predictor variable vector that predicts response variable yiR. Given a machine learning algorithm A, the loss function L(A(Dn),z) measures the error incurred by the decision that algorithm A trained on Dn makes on a new sample z that is independently drawn from P. Extensively used loss functions include the squared loss function, absolute loss function for regression,

BRLT

Intuitively, minimizing Eq. (7), that is, Var(μ^RLT(S)), requires all ϕjjs in Φ to simultaneously reach their minimum. However, the requirement is difficult to satisfy because ϕjjs are correlated. Several other ϕjjs might be far away from the minimum when one of ϕjjs reaches its minimum (n1n2). All ϕjjs will ideally be (Jn1/n2)n/(J2) when they are simultaneously minimized (proof of Theorem 1). Although ϕjjs cannot reach the ideal value in several situations, they should not be far away

Experimental data sets and settings

We mainly illustrate the optimal properties of BRLT in regression and classification tasks in which several simulated and real-world data sets are considered, respectively. Multiple classical machine learning algorithms are used on these data sets. The settings of the simulation data sets and algorithms are as follows.

  • Simulated regression data set (SREG). In data set Dn=(xi,yi)i=1n, predictor vector xi contains p predictors, that is, xi=(xi,1,,xi,p)N(0p,Σp×p), where 0p is a zero vector with p

Results and analysis

In this section, we show the optimal properties of BRLT with regard to the following aspects via simulated experiments.

Aspect 1: Is covariance function f(x) a convex and monotonically increasing function w.r.t. the number of overlapping samples ϕ=x?

Aspect 2: How large is the difference of variances between the RLT estimator with n1=(K1)n/K and the standard KFCV estimator of the generalization error?

Aspect 3: How large are the differences of variances between the estimators of the

Conclusion

We developed a new data partitioning schema for RLT called BRLT. In BRLT, the numbers of overlapping samples between any two training sets are constrained, and the corresponding estimator has a smaller variance than does the RLT estimator. We provided construction algorithms for the partition sets of BRLT and validated their optimal properties with simulated and real-world data sets.

In the future, we plan to develop novel estimators of the variance of the BRLT estimator and apply BRLT and these

Acknowledgments

This work was supported by the National Social Science Fund of China (NSSFC-16BTJ034). Experiments were supported by High Performance Computing System of Shanxi University.

References (36)

  • U.M. Braga-Neto et al.

    Is cross-validation valid for small-sample microarray classification?

    Bioinformatics

    (2004)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • P. Burman

    A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods

    Biometrika

    (1989)
  • W. Daelemans et al.

    Evaluation of machine learning methods for natural language processing tasks

    3rd International Conference on Language Resources and Evaluation (LREC 2002)

    (2002)
  • T.G. Dietterich

    Approximate statistical tests for comparing supervised classification learning algorithms

    Neural Comput.

    (1998)
  • B. Efron et al.

    Improvements on cross-validation: the 632+ bootstrap method

    J. Am. Stat. Assoc.

    (1997)
  • J. Fan et al.

    Sure independence screening for ultrahigh dimensional feature space

    J. R. Stat. Soc.

    (2008)
  • O. Gascuel et al.

    Distribution-free performance bounds with the resubstitution error estimate

    Pattern Recognit. Lett.

    (1992)
  • Cited by (4)

    View full text