Elsevier

Neurocomputing

Volume 432, 7 April 2021, Pages 10-20
Neurocomputing

Learning performance of LapSVM based on Markov subsampling

https://doi.org/10.1016/j.neucom.2020.12.014Get rights and content

Abstract

It has become common to collect massive datasets in modern applications. The massive and highly noise contaminated data pose serious challenges to conventional semi-supervised learning methods. To tackle such challenges from the large-quantity-low-quality situation, we propose a distribution-free Markov subsampling strategy based on Laplacian support vector machine (LapSVM) to achieve robust and effective estimation. The core idea is to construct an informative subset which allows us to conservatively correct a rough initial estimate towards the true classifier. Specifically, the proposed subsampling strategy selects samples with small losses via a probabilistic procedure, constructing a subset which stands a good chance of excluding the noise data and providing a safe improvement over the rough initial estimate. Theoretically, we show that the obtained classifier is statistically consistent and can achieve fast learning rate under mild conditions. The promising performance is also supported by simulation studies and real data examples.

Introduction

Classification is one of the fundamental tasks in statistics and machine learning; it aims to identify the category of a new observation based on the knowledge from a training dataset containing observations with known category membership (labels). In modern scientific research, it is increasingly frequent that analysts need to deal with a huge amount of training data, where people may only label the category membership for a small partial of observations. For example, customs need to flag the illegitimate shipments from hundreds of thousands of declarations in a day; due to the limited recourses, they may only check and label a few hundred of them. Apparently, simply deleting all the unlabeled data from the training set may lead to a significant loss of data information. How to construct a good classifier utilizing both labeled and unlabeled data has been an important research problem.

In the literature, several strategies have been proposed for learning with unlabeled data. In particular, as a benchmark semi-supervised learning method, Laplacian Support Vector Machine (LapSVM) has attracted considerable attention in the recent years, see, for example, [1], [2], [3], [4]. It incorporates the manifold information of unlabeled data into classification using a similarity measure between observations. LapSVM has been demonstrated effective and efficient in the applications when the size of training set is moderate. When the amount of training data is huge, LapSVM becomes less effective or even infeasible, as solving the associated regularization problem is numerically costly. Moreover, in many applications, the big training set often comes with a complex structure and is highly noise-contaminated, as the data are not collected from a designed experiment. When the noise level is high, incorporating all the unlabeled data into the training process might deteriorate the predictive performance of the classifier due to overfitting.

To tackle the challenges from such “large-quantity-low-quality” situation, one natural idea is to first select an informative subset from the original data [5], [6], [7], and then train a classifier based on the subset. For the computational convenience and the effectiveness of training, such an informative subset should include only a moderate number of observations and can capture the essential classification information contained in the original data. Apparently, finding such a good subset is not straightforward, especially in the case of our problem setup, where the majority of observations are unlabeled and their impacts on finding the classifier are unknown beforehand.

In this paper, we propose a distribution-free Markov subsampling scheme to implement the idea of subset learning discussed above. Suppose that a training set contains n1 labeled observations and n2 unlabeled observations, with the total number of observations n=n1+n2 being huge and n1n2. The proposed scheme first finds a pilot function f0 based on all the labeled observations and a small number of randomly selected unlabeled observations. The function f0 then is used to impute the category membership for any unlabeled observation with a plausible class label. Starting from a random observation in the original data, the scheme then recursively evaluates the ratio of classification risk between the current observation and the candidate observations, and accepts/rejects the candidate observations in probabilities proportional to the contribution of them towards finding a better classifier. The procedure stops when m observations are selected after a short burning period. With the proposed scheme, a set of uniformly ergodic Markov chain (u.e.M.c) observations are then generated. We will show that so obtained m observations constitute an manageable informative subset DM, which is a refined representative of the original big training set and can be more efficiently used to train an ideal classifier.

Compared with many other Markov sampling methods in the literature (see, e.g., [8], [9], [10]), the proposed scheme does not require any prior information on distribution of data. Thus, it is particularly suited for learning from complex data. Under mild conditions, we will show that the LapSVM based on the selected DM is consistent in statistics, with an optimal Op(m-1) convergence rate to the oracle classifier. The promising performance of the proposed method is supported by both simulations and real data applications.

The rest of the paper is organized as follows. The problem is set up in Section 2. Section 3 states the proposed Markov subsampling scheme. Section 4 presents the generalization analysis results on the LapSVM with u.e.M.c observations. Section 5 then demonstrates the experimental evaluation results for the proposed Markov subsampling strategy. Finally, Section 6 summarizes the paper with some useful remarks.

Section snippets

LapSVM algorithm

Let Y{-1,1} be a binary response and XXRd be a d-dimensional covariate drawn from a compact set X. Suppose that Z=X×Y follows from a fixed but unknown distribution ρ with its support fully filled on X×{-1,1}. Let (x,y) be an observed value of (X,Y). We haveρ(x,y)=ρX(x)ρ(y|x),where ρX(x) is the marginal probability of X=x and ρ(y|x) is the conditional probability of Y=y given X=x.

A classifier g:X{-1,1} is a mapping that predicts the class label y for each possible value of X. The accuracy of

A Markov subsampling scheme with LapSVM

How to make a good use of unlabeled data is quite challenging in big data context. We observe that not all unlabeled observations can play an equally important role in learning process. Hence it is naturally expected to select a representative subset from the original dataset.

Intuitively, Markov Chain Monte Carlo (MCMC) [8], [10], [14] is an applicable strategy to achieve this goal. Nevertheless, the main issue is how to appropriately specify the involved acceptance probabilities. In particular

Learning rate of LapSVM with u.e.M.c subsample

In this section, we theoretically access the generalization performance of LapSVM when a u.e.M.c subsample is used. Our goal is to provide an upper bound estimation on the excess misclassification errorR(sgn(fDM))-R(gc),where gc is the oracle classifier defined in (2) and fDM is the estimator generated by LapSVM on DM.

For any fHK, we define its expected risk with hinge loss ϕ byE(f)=X×{-1,+1}ϕ(yf(x))dρ(x,y)and define the corresponding empirical risk byEz(f)=1n1iI1ϕ(yi,f(xi)).

It has been

Numerical studies

This section evaluates the performance of LapSVM with the proposed Markov subsampling scheme. The evaluation is conducted by comparing the LapSVM with u.e.M.c observations subsampled by Algorithm 1 in Section 3, denoted by LapSVM-MS, and the LapSVM with the original i.i.d. observations, denoted by LapSVM-IID, on both simulations and real applications. In all experiments, the Gaussian kernel K(x,x)=exp{-x-x2/2σ2} is adopted, with the width parameter σand regularization parameter λ,γ are

Conclusion

In this paper, we propose a Markov subsampling strategy based on LapSVM to deal with the “Large-quantity-low quality” situation in big data. We analyze the generalization performance of the proposed subsampling method. The theoretical results show that the LapSVM estimator based on Markov subsampling is statistically consistent and can achieve fast learning rate under mild conditions. Experiments on simulation studies and real data examples demonstrate the effectiveness of the proposed

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the NSERC grant RGPIN-2016-05024 and in part by National Natural Science Foundation of China (NSFC) under Grant 11690014, 11671161, 11971373. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agencies.

Tieliang Gong received his PhD degree from Xi’an Jiaotong University, Xi’an, China, in 2018. From September 2018 to August 2020, he was a Post-Doctoral Researcher with the Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada. He is currently with School of Computer Science of Technology, Xi’an Jiaotong University, Xi’an. His research interests include statistical learning theory, machine learning and high-dimensional statistical inference.

References (30)

  • Y. Cao et al.

    Consistency of regularized spectral clustering

    Appl. Comput. Harmon. Anal.

    (2011)
  • D. Zhou

    The covering number in learning theory

    J. Complex.

    (2002)
  • M. Belkin et al.

    Semi-supervised learning on riemannian manifolds

    Mach. Learn.

    (2004)
  • M. Belkin et al.

    Regularization and semi-supervised learning on large graphs

  • S. Melacci et al.

    Laplacian support vector machines trained in the primal

    J. Mach. Learn. Res.

    (2011)
  • P. Niyogi

    Manifold regularization and semi-supervised learning: some theoretical analyses

    J. Mach. Learn. Res.

    (2013)
  • H. Avron et al.

    Faster subset selection for matrices and applications

    SIAM J. Matrix. Anal. A

    (2013)
  • M. Dereziński et al.

    Reverse iterative volume sampling for linear regression

    J. Mach. Learn. Res.

    (2018)
  • D. Ting et al.

    Optimal subsampling with influence functions

  • C. Andrieu et al.

    An introduction to mcmc for machine learning

    Mach. Learn.

    (2003)
  • F. Liang et al.

    An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond

    J.R. Stat. Soc. Series. B (Stat. Methodol.)

    (2018)
  • J.E. Johndrow et al.

    Mcmc for imbalanced categorical data

    J. Am. Stat. Assoc.

    (2019)
  • M. Belkin et al.

    Manifold regularization: a geometric framework for learning from labeled unlabeled data examples

    J. Mach. Learn. Res.

    (2006)
  • D. Down et al.

    Exponential and uniform ergodicity of markov processes

    Annal. Prob.

    (1995)
  • S.P. Meyn et al.

    Markov Chains and Stochastic Stability

    (2012)
  • Cited by (2)

    • Identification of coal structures by semi-supervised learning based on limited labeled logging data

      2023, Fuel
      Citation Excerpt :

      It is an extension of traditional support vector machine (SVM), and introduces the structure information of unlabeled data to standard SVM in form of Laplacian manifold regularization to obtain more accurate classification. By combining classification strategies of SVM (kernel-based method) and semi-supervised learning, LapSVM generally outperforms other semi-supervised methods in highly nonlinear and few labeled-sample tasks [44–47]. In recent years, LapSVM has been gradually applied in petroleum geology, such as the lithology identification [35] and fracture predication based on well logging data [37].

    • Maximum density minimum redundancy based hypergraph regularized support vector regression

      2023, International Journal of Machine Learning and Cybernetics

    Tieliang Gong received his PhD degree from Xi’an Jiaotong University, Xi’an, China, in 2018. From September 2018 to August 2020, he was a Post-Doctoral Researcher with the Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada. He is currently with School of Computer Science of Technology, Xi’an Jiaotong University, Xi’an. His research interests include statistical learning theory, machine learning and high-dimensional statistical inference.

    Hong Chen received the B.S., M.S., and Ph.D. degrees from Hubei University, Wuhan, China, in 2003, 2006, and 2009, respectively. From February 2016 to August 2017, he was a Post-Doctoral Researcher with the Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX, USA. He is currently a Professor with the Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan. His current research interests include machine learning, statistical learning theory, and approximation theory.

    Chen Xu received the PhD degree in statistics from the University of British Columbia, Canada. He is an associate professor in statistics at the University of Ottawa, Canada. His research interests include high-dimensional data, big data, ker- nel methods, and statistical computing. His work has been funded by the Natural Sciences and Engineering Research Council of Canada. He served as guest editor for Neurocomputing (2015- -2016).

    View full text