Learning performance of LapSVM based on Markov subsampling

doi:10.1016/j.neucom.2020.12.014

Neurocomputing

Volume 432, 7 April 2021, Pages 10-20

https://doi.org/10.1016/j.neucom.2020.12.014 Get rights and content

Abstract

It has become common to collect massive datasets in modern applications. The massive and highly noise contaminated data pose serious challenges to conventional semi-supervised learning methods. To tackle such challenges from the large-quantity-low-quality situation, we propose a distribution-free Markov subsampling strategy based on Laplacian support vector machine (LapSVM) to achieve robust and effective estimation. The core idea is to construct an informative subset which allows us to conservatively correct a rough initial estimate towards the true classifier. Specifically, the proposed subsampling strategy selects samples with small losses via a probabilistic procedure, constructing a subset which stands a good chance of excluding the noise data and providing a safe improvement over the rough initial estimate. Theoretically, we show that the obtained classifier is statistically consistent and can achieve fast learning rate under mild conditions. The promising performance is also supported by simulation studies and real data examples.

Introduction

Classification is one of the fundamental tasks in statistics and machine learning; it aims to identify the category of a new observation based on the knowledge from a training dataset containing observations with known category membership (labels). In modern scientific research, it is increasingly frequent that analysts need to deal with a huge amount of training data, where people may only label the category membership for a small partial of observations. For example, customs need to flag the illegitimate shipments from hundreds of thousands of declarations in a day; due to the limited recourses, they may only check and label a few hundred of them. Apparently, simply deleting all the unlabeled data from the training set may lead to a significant loss of data information. How to construct a good classifier utilizing both labeled and unlabeled data has been an important research problem.

In the literature, several strategies have been proposed for learning with unlabeled data. In particular, as a benchmark semi-supervised learning method, Laplacian Support Vector Machine (LapSVM) has attracted considerable attention in the recent years, see, for example, [1], [2], [3], [4]. It incorporates the manifold information of unlabeled data into classification using a similarity measure between observations. LapSVM has been demonstrated effective and efficient in the applications when the size of training set is moderate. When the amount of training data is huge, LapSVM becomes less effective or even infeasible, as solving the associated regularization problem is numerically costly. Moreover, in many applications, the big training set often comes with a complex structure and is highly noise-contaminated, as the data are not collected from a designed experiment. When the noise level is high, incorporating all the unlabeled data into the training process might deteriorate the predictive performance of the classifier due to overfitting.

To tackle the challenges from such “large-quantity-low-quality” situation, one natural idea is to first select an informative subset from the original data [5], [6], [7], and then train a classifier based on the subset. For the computational convenience and the effectiveness of training, such an informative subset should include only a moderate number of observations and can capture the essential classification information contained in the original data. Apparently, finding such a good subset is not straightforward, especially in the case of our problem setup, where the majority of observations are unlabeled and their impacts on finding the classifier are unknown beforehand.

In this paper, we propose a distribution-free Markov subsampling scheme to implement the idea of subset learning discussed above. Suppose that a training set contains $n_{1}$ labeled observations and $n_{2}$ unlabeled observations, with the total number of observations $n = n_{1} + n_{2}$ being huge and $n_{1} ≪ n_{2}$ . The proposed scheme first finds a pilot function $f_{0}$ based on all the labeled observations and a small number of randomly selected unlabeled observations. The function $f_{0}$ then is used to impute the category membership for any unlabeled observation with a plausible class label. Starting from a random observation in the original data, the scheme then recursively evaluates the ratio of classification risk between the current observation and the candidate observations, and accepts/rejects the candidate observations in probabilities proportional to the contribution of them towards finding a better classifier. The procedure stops when m observations are selected after a short burning period. With the proposed scheme, a set of uniformly ergodic Markov chain (u.e.M.c) observations are then generated. We will show that so obtained m observations constitute an manageable informative subset $D_{M}$ , which is a refined representative of the original big training set and can be more efficiently used to train an ideal classifier.

Compared with many other Markov sampling methods in the literature (see, e.g., [8], [9], [10]), the proposed scheme does not require any prior information on distribution of data. Thus, it is particularly suited for learning from complex data. Under mild conditions, we will show that the LapSVM based on the selected $D_{M}$ is consistent in statistics, with an optimal $O_{p} (m^{- 1})$ convergence rate to the oracle classifier. The promising performance of the proposed method is supported by both simulations and real data applications.

The rest of the paper is organized as follows. The problem is set up in Section 2. Section 3 states the proposed Markov subsampling scheme. Section 4 presents the generalization analysis results on the LapSVM with u.e.M.c observations. Section 5 then demonstrates the experimental evaluation results for the proposed Markov subsampling strategy. Finally, Section 6 summarizes the paper with some useful remarks.

Section snippets

LapSVM algorithm

Let $Y \in {- 1, 1}$ be a binary response and $X \in X \subset R^{d}$ be a d-dimensional covariate drawn from a compact set $X$ . Suppose that $Z = X \times Y$ follows from a fixed but unknown distribution $ρ$ with its support fully filled on $X \times {- 1, 1}$ . Let $(x, y)$ be an observed value of $(X, Y)$ . We have $ρ (x, y) = ρ_{X} (x) ρ (y | x),$ where $ρ_{X} (x)$ is the marginal probability of $X = x$ and $ρ (y | x)$ is the conditional probability of $Y = y$ given $X = x$ .

A classifier $g : X \to {- 1, 1}$ is a mapping that predicts the class label y for each possible value of X. The accuracy of

A Markov subsampling scheme with LapSVM

How to make a good use of unlabeled data is quite challenging in big data context. We observe that not all unlabeled observations can play an equally important role in learning process. Hence it is naturally expected to select a representative subset from the original dataset.

Intuitively, Markov Chain Monte Carlo (MCMC) [8], [10], [14] is an applicable strategy to achieve this goal. Nevertheless, the main issue is how to appropriately specify the involved acceptance probabilities. In particular

Learning rate of LapSVM with u.e.M.c subsample

In this section, we theoretically access the generalization performance of LapSVM when a u.e.M.c subsample is used. Our goal is to provide an upper bound estimation on the excess misclassification error $R (sgn (f_{D_{M}})) - R (g_{c}),$ where $g_{c}$ is the oracle classifier defined in (2) and $f_{D_{M}}$ is the estimator generated by LapSVM on $D_{M}$ .

For any $f \in H_{K}$ , we define its expected risk with hinge loss $ϕ$ by $E (f) = \int_{X \times {- 1, + 1}} ϕ (yf (x)) d ρ (x, y)$ and define the corresponding empirical risk by $E_{z} (f) = \frac{1}{n_{1}} \sum_{i \in I_{1}} ϕ (y_{i}, f (x_{i})) .$

It has been

Numerical studies

This section evaluates the performance of LapSVM with the proposed Markov subsampling scheme. The evaluation is conducted by comparing the LapSVM with u.e.M.c observations subsampled by Algorithm 1 in Section 3, denoted by LapSVM-MS, and the LapSVM with the original i.i.d. observations, denoted by LapSVM-IID, on both simulations and real applications. In all experiments, the Gaussian kernel $K (x, x^{'}) = \exp {- ∥ x - x^{' 2} / 2 σ^{2}}$ is adopted, with the width parameter $σ$ and regularization parameter $λ, γ$ are

Conclusion

In this paper, we propose a Markov subsampling strategy based on LapSVM to deal with the “Large-quantity-low quality” situation in big data. We analyze the generalization performance of the proposed subsampling method. The theoretical results show that the LapSVM estimator based on Markov subsampling is statistically consistent and can achieve fast learning rate under mild conditions. Experiments on simulation studies and real data examples demonstrate the effectiveness of the proposed

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the NSERC grant RGPIN-2016-05024 and in part by National Natural Science Foundation of China (NSFC) under Grant 11690014, 11671161, 11971373. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agencies.

Tieliang Gong received his PhD degree from Xi’an Jiaotong University, Xi’an, China, in 2018. From September 2018 to August 2020, he was a Post-Doctoral Researcher with the Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada. He is currently with School of Computer Science of Technology, Xi’an Jiaotong University, Xi’an. His research interests include statistical learning theory, machine learning and high-dimensional statistical inference.

References (30)

Y. Cao et al.
Consistency of regularized spectral clustering
Appl. Comput. Harmon. Anal.
(2011)
D. Zhou
The covering number in learning theory
J. Complex.
(2002)
M. Belkin et al.
Semi-supervised learning on riemannian manifolds
Mach. Learn.
(2004)
M. Belkin et al.
Regularization and semi-supervised learning on large graphs
S. Melacci et al.
Laplacian support vector machines trained in the primal
J. Mach. Learn. Res.
(2011)
P. Niyogi
Manifold regularization and semi-supervised learning: some theoretical analyses
J. Mach. Learn. Res.
(2013)
H. Avron et al.
Faster subset selection for matrices and applications
SIAM J. Matrix. Anal. A
(2013)
M. Dereziński et al.
Reverse iterative volume sampling for linear regression
J. Mach. Learn. Res.
(2018)
D. Ting et al.
Optimal subsampling with influence functions
C. Andrieu et al.
An introduction to mcmc for machine learning
Mach. Learn.
(2003)

F. Liang et al.

An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond

J.R. Stat. Soc. Series. B (Stat. Methodol.)

(2018)

J.E. Johndrow et al.

Mcmc for imbalanced categorical data

J. Am. Stat. Assoc.

(2019)

M. Belkin et al.

Manifold regularization: a geometric framework for learning from labeled unlabeled data examples

J. Mach. Learn. Res.

(2006)

D. Down et al.

Exponential and uniform ergodicity of markov processes

Annal. Prob.

(1995)

S.P. Meyn et al.

Markov Chains and Stochastic Stability

(2012)

Cited by (2)

Identification of coal structures by semi-supervised learning based on limited labeled logging data
2023, Fuel
Citation Excerpt :
It is an extension of traditional support vector machine (SVM), and introduces the structure information of unlabeled data to standard SVM in form of Laplacian manifold regularization to obtain more accurate classification. By combining classification strategies of SVM (kernel-based method) and semi-supervised learning, LapSVM generally outperforms other semi-supervised methods in highly nonlinear and few labeled-sample tasks [44–47]. In recent years, LapSVM has been gradually applied in petroleum geology, such as the lithology identification [35] and fracture predication based on well logging data [37].
Coal structure is a critical parameter in coalbed methane (CBM) development due to its significant impacts on methane enrichment, fluid flow and hydraulic fracturing. Traditional statistical analysis and data-driven machine learning methods for coal structure identification are highly dependent on the labeled logging data and have potential limitations when labeled logging data is limited. To address this issue, this paper proposed a semi-supervised learning method based on Laplacian support vector machine (LapSVM) to identify coal structure by using few labeled logging data. By mining the structure information from abundant unlabeled data, LapSVM can improve the model performance and alleviate the over-reliance on labeled data. To evaluate and verify the effectiveness and reliability of the proposed LapSVM method in coal structure identification, datasets collected from 32 CBM wells in the southern Qinshui Basin, China, are utilized in this study. The particle swarm optimization (PSO) is adopted for parameter optimization of LapSVM models. For the LapSVM model, the addition of unlabeled data is conducive to enhance model accuracy, and unavoidably increases the computational cost at the same time. The comparison of training, testing and blind-well test results between the LapSVM and standard support vector machine (SVM) models indicates that the LapSVM outperforms traditional SVM and possesses higher accuracy and generalization in coal structure identification. It has been demonstrated that the LapSVM can be a reliable tool for coal structure identification when limited labeled logging data is available.
Maximum density minimum redundancy based hypergraph regularized support vector regression
2023, International Journal of Machine Learning and Cybernetics

Hong Chen received the B.S., M.S., and Ph.D. degrees from Hubei University, Wuhan, China, in 2003, 2006, and 2009, respectively. From February 2016 to August 2017, he was a Post-Doctoral Researcher with the Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, TX, USA. He is currently a Professor with the Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan. His current research interests include machine learning, statistical learning theory, and approximation theory.

Chen Xu received the PhD degree in statistics from the University of British Columbia, Canada. He is an associate professor in statistics at the University of Ottawa, Canada. His research interests include high-dimensional data, big data, ker- nel methods, and statistical computing. His work has been funded by the Natural Sciences and Engineering Research Council of Canada. He served as guest editor for Neurocomputing (2015- -2016).

View full text

Learning performance of LapSVM based on Markov subsampling

Abstract

Introduction

Section snippets

LapSVM algorithm

A Markov subsampling scheme with LapSVM

Learning rate of LapSVM with u.e.M.c subsample

Numerical studies

Conclusion

Declaration of Competing Interest

Acknowledgments

Appl. Comput. Harmon. Anal.

J. Complex.

Semi-supervised learning on riemannian manifolds

Mach. Learn.

Regularization and semi-supervised learning on large graphs

Laplacian support vector machines trained in the primal

J. Mach. Learn. Res.

Manifold regularization and semi-supervised learning: some theoretical analyses

J. Mach. Learn. Res.

Faster subset selection for matrices and applications

SIAM J. Matrix. Anal. A

Reverse iterative volume sampling for linear regression

J. Mach. Learn. Res.

Optimal subsampling with influence functions

An introduction to mcmc for machine learning

Mach. Learn.

An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond

J.R. Stat. Soc. Series. B (Stat. Methodol.)

Mcmc for imbalanced categorical data

J. Am. Stat. Assoc.

Manifold regularization: a geometric framework for learning from labeled unlabeled data examples

J. Mach. Learn. Res.

Exponential and uniform ergodicity of markov processes

Annal. Prob.

Markov Chains and Stochastic Stability