Elsevier

Image and Vision Computing

Volume 60, April 2017, Pages 30-37
Image and Vision Computing

Random Multi-Graphs: A semi-supervised learning framework for classification of high dimensional data*

https://doi.org/10.1016/j.imavis.2016.08.006Get rights and content

Highlights

  • A novel graph-based semi-supervised learning framework is proposed.

  • RMG can handle high dimensional problems by injecting randomness into the graph.

  • Randomness as a regularization can avoid curse of dimensionality and overfitting.

  • Experimental results on eight data sets are presented to show the effectiveness.

Abstract

Currently, high dimensional data processing confronts two main difficulties: inefficient similarity measure and high computational complexity in both time and memory space. Common methods to deal with these two difficulties are based on dimensionality reduction and feature selection. In this paper, we present a different way to solve high dimensional data problems by combining the ideas of Random Forests and Anchor Graph semi-supervised learning. We randomly select a subset of features and use the Anchor Graph method to construct a graph. This process is repeated many times to obtain multiple graphs, a process which can be implemented in parallel to ensure runtime efficiency. Then the multiple graphs vote to determine the labels for the unlabeled data. We argue that the randomness can be viewed as a kind of regularization. We evaluate the proposed method on eight real-world data sets by comparing it with two traditional graph-based methods and one state-of-the-art semi-supervised learning method based on Anchor Graph to show its effectiveness. We also apply the proposed method to the subject of face recognition.

Introduction

High dimensional data classification problems have been ubiquitous due to significant advances in computing technology, i.e. bag-of-words representation of documents classification with a huge dictionary, gene expression classification, multimedia classification, etc. High dimensionality poses significant mathematical challenges to traditional classification methods because of computational time and space complexity. We analyze two difficulties in high dimensional data processing. One difficulty is the problem of inefficient similarity measure. Actually it makes sense for Euclidean distances in about 2 to 10 dimensional spaces, which is usually used as similarity measure between data points. However the comparability of Euclidean distance between data points does not exist in high dimensional space due to the sparsity of high dimensional data. Its effectiveness as the similarity measure declines along with the increase of the dimensions. The other difficulty is the so-called Curse of Dimensionality. Increasing dimensions can bring about an explosive growth in calculation time and memory space.

Two common methods to deal with high dimensional data are Dimensionality Reduction and Feature Selection. Dimension Reduction techniques try to obtain a low dimensional embedding from high dimensional data, which is essentially divided into two categories: the linear methods and the nonlinear methods. Linear methods such as Principle Component Analysis (PCA), Independent Component Analysis (ICA) and Linear Discriminant Analysis (LDA) and nonlinear methods such as Laplacian Eigenmaps (LE), Local Linear Embedding (LLE), MultiDimensional Scaling (MDS), Isometric Mapping (ISOMAP), and Kernel PCA (KPCA), are all commonly used. See [1] as a tutorial. Feature Selection techniques try to select the more effective features and eliminate the irrelevant ones [2]. Nowadays research on feature selection focuses on search strategy and evaluation criteria. See [3] as a tutorial. Generally speaking the core idea behind the two methods is to use fewer significant and discriminatory features to represent the original data.

However this paper proposes a different way to handle high dimensional data. We consider the semi-supervised setting based on the following points. Firstly the data acquisition is more and more easy due to technology improvements. Thus the amount of data is becoming more and more large scale with much higher dimensions. This is what we called Big Data. Secondly it is very common in application domains that labeled data is scarce and expensive but unlabeled data is large and cheap. Supervised learning does not suit this scenario, while semi-supervised learning which uses a large amount of unlabeled data to help improve the classification performance was born for this purpose.

To solve the difficulties of high dimensional data processing mentioned above, we observe two facts. One is the Random Forests [4] method, it can handle high dimensional data without dimensionality reduction or feature selection. It has good generalization performance and is not easy to overfit due to the randomness. We adopt a similar idea to inject randomness in graphs by randomly select a subset of features to create a graph. The size of the feature subset is usually far smaller than the original dimension. Thus in this selected feature space similarity measure based on Euclidean distance can be effective again due to its lower dimensions. Another fact is the subset-based large scale graph construction method Anchor Graph [5]. It uses a small subset of data points called anchors to construct the whole graph of a data set. It scales linearly with the size of the data set and can deal with very large scale data sets. We utilize the Anchor Graph method to construct a graph in the randomly selected feature space and do semi-supervised inference on this graph. So combing the idea of Random Forests with semi-supervised learning based on Anchor Graph, we propose a new semi-supervised framework named Random Multi-Graphs to deal with high dimensional and large scale data problem. We randomly select a subset of features and use Anchor Graph to construct a graph. The above process is repeated to obtain multiple graphs which can be implemented in parallel to ensure the runtime efficiency and then the multiple graphs vote to determine the labels for the unlabeled data.

We evaluate the proposed method on eight real-world data sets; compared with two traditional graph-based methods and one state-of-the-art semi-supervised learning method based on Anchor Graph to show its effectiveness. As an application of the proposed method, we will analyze data for the purpose of face recognition from images.

The main contributions of this paper are as follows:

  • We present a new graph-based semi-supervised learning framework to handle high dimensional and large scale data problem by injecting randomness to the graph.

  • We show that the randomness can be viewed as a kind of regularization technique to avoid the curse of dimensionality and overfitting.

  • Experiments show the performance increase in high dimensional data problem.

The rest of this paper is organized as follows. In Section 2 we give a brief introduction to graph-based semi-supervised learning, Random Forests and Anchor Graph. The proposed framework is described in Section 3. In Section 4, the experiments are presented, and Section 5 concludes this paper.

Section snippets

Background

In this section, we briefly review three related topics: 1) graph-based semi-supervised learning framework; 2) Random Forests; and 3) Anchor Graph.

Random Multi-Graphs

Fig. 1 shows the proposed method. The details are presented below.

Experiments

We evaluate the proposed method on eight real-world data sets to show its effectiveness.

Conclusion and future work

We focused on the graph-based semi-supervised learning of high dimensional data. Combining the ideas of Random Forests and Anchor Graph we propose a new framework to deal with high dimensional data, which can effectively avoid the curse of dimensionality and efficiently obtains better classification accuracy. We randomly choose a subset of the features to create a graph based on anchors, repeat the above process to obtain multiple graphs, then vote to determine the labels of unlabeled data.

We

Acknowledgments

This work was supported by the National Natural Science Foundation Of China (NSFC) (Nos. 61271405, 61403353), the International Science & Technology Cooperation Program of China (ISTCP) (No. 2014DFA10410), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) and the Fundamental Research Funds for the Central Universities of China. The Titan X GPU used for this research was donated by the NVIDIA Corporation. We thank Leon Bullock very much for his great assistance in

References (41)

  • G. Zhong et al.

    Error-correcting output codes based ensemble feature extraction

    Pattern Recogn.

    (2013)
  • G. Chandrashekar et al.

    A survey on feature selection methods

    Comput. Electr. Eng.

    (2014)
  • I.K. Fodor

    A survey of dimension reduction techniques

    Neoplasia

    (2008)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • W. Liu et al.

    Large graph construction for scalable semi-supervised learning

  • M. Abbasi et al.

    Monocular 3D human pose estimation with a semi-supervised graph-based method

  • A.B. Goldberg et al.

    Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization

  • X. Zeng et al.

    Chinese named entity recognition with graph-based semi-supervised learning model

  • R. Liu et al.

    Graph-based semi-supervised learning algorithm for web page classification

  • B.B. Liu et al.

    Image colourisation using graph-based semi-supervised learning

    IET Image Process.

    (2009)
  • L. Ma et al.

    Graph-based semi-supervised learning for spectral-spatial hyperspectral image classification

    Pattern Recogn. Lett.

    (2016)
  • M. Stikic et al.

    Multi-graph based semi-supervised learning for activity recognition

  • J. Tang et al.

    Image annotation by graph-based inference with integrated multiple/single instance representations

    IEEE Trans. Multimedia

    (2010)
  • Y. Zhao et al.

    Graph-based semi-supervised learning for fault detection and classification in solar photovoltaic arrays

    IEEE Trans. Power Electron.

    (2013)
  • A. Blum et al.

    Learning from labeled and unlabeled data using graph mincuts

  • X. Zhu et al.

    Semi-supervised learning using Gaussian fields and harmonic functions

  • D. Zhou et al.

    Learning with local and global consistency

    Adv. Neural Inf. Proces. Syst.

    (2004)
  • M. Belkin et al.

    Manifold regularization: a geometric framework for learning from labeled and unlabeled examples

    J. Mach. Learn. Res.

    (2006)
  • M. Seeger

    Learning with labeled and unlabeled data

    (2000)
  • O. Chapelle et al.

    Semi-supervised Learning

    (2006)
  • Cited by (21)

    • Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection

      2019, Pattern Recognition
      Citation Excerpt :

      Semi-supervised feature selection can be viewed as a compromise between supervised and unsupervised cases. The key issue of semi-supervised approach is to sufficiently utilize those labeled and unlabeled data for boosting the learning performance [61]. Recent years, several semi-supervised feature selection methods have been proposed.

    • Semi-supervised multi-label feature selection via label correlation analysis with l<inf>1</inf>-norm graph embedding

      2017, Image and Vision Computing
      Citation Excerpt :

      It is reported to be one of the best methods to cope with the “small-labeled-sample” problem [12]. The key issue of semi-supervised feature selection is to design a framework, where the labeled and unlabeled data can be sufficiently utilized to boost the learning performance [12–14]. In the past two decades, many semi-supervised feature selection algorithms have been proposed.

    View all citing articles on Scopus
    *

    This paper has been recommended for acceptance by Jiwen Lu.

    View full text