Elsevier

Pattern Recognition Letters

Volume 151, November 2021, Pages 127-134
Pattern Recognition Letters

Effective semi-supervised learning for structured data using Embedding GANs

https://doi.org/10.1016/j.patrec.2021.07.019Get rights and content

Highlights

  • A new branch is added to process structured data.

  • The objective function is modified to adapt to the new branch.

  • Detailed comparisons are made between our algorithm and the other classical models.

Abstract

The semi-supervised learning(SSL) was proposed to deal with the situation that only a few samples were labeled, so how to make the most use of the existing samples is crucial. In general, we apply data augmentation methods, like Generative Adversarial Networks(GAN), to increase the number of data when facing unstructured data. However, things get totally different with structured data, the continuity of neural networks limits their use in category variables. In this paper, we propose a semi-supervised Embedding GAN(EmGAN) to solve that problem. We add an embedding layer before the discriminator to better characterize category features and design a new loss function to further train our model. Moreover, the structures of the generator and discriminator are modified to match structured data. With the experiments of nonparametric statistical test, EmGAN shows its advantages in processing structured samples.

Introduction

It is harder to gather marked samples than unmarked samples in many different fields ranging from abnormal detection to mail filtering system. In this situation, semi-supervised learning (SSL) has drawn a lot of attention. The goal of this paradigm is to design the model in the presence of both labeled and unlabeled data.

Nowadays, the SSL algorithm is growing rapidly in different research areas. In this section, we first provide a basic review of this direction and introduce some other related works which got high concerns in the last few years.

Generally, there are four approaches that have been widely recognized to tackle SSL:

  • self-labeling: Suppose the classifier is more biased towards the correctly predicted sample, the prediction result of the unlabeled sample with the highest confidence will be considered as the corresponding new label. The famous approaches involved in self-labeling including Democratic-Co [1], Tri-Training [2], Co-Bagging [3], and Co-Training [4];

  • Generative models and cluster-then-label algorithm [5], [6]: This is the first attempt of generation model in the field of SSL. The purpose of the entire model is to learn the joint probability p(x,y) = p(y)p(xy), where p(xy) can be a Gaussian mixture model or any identifiable mixture distribution. This model is a determined coefficient algorithm using both unlabeled and labeled data. The Cluster-then-label algorithm is very similar to generative models. This method first clusters the whole data set, then label each cluster based on the labeled data;

  • Graph-based algorithm: [7] considers SSL as a graph min-cut problem. Graph nodes are represented as labeled and unlabeled samples, and the similarity between samples corresponds to the value of graph edges. This algorithm generally assumes label smoothness over the whole graph. Recent advances can be found in [8];

  • Semi-supervised SVM (S3VM): S3VM is an extension of the support vector machines for training with unlabeled data [9]. This method implements the cluster hypothesis of SSL. In other words, the classes of data are well separated and do not split dense unlabeled data because the sample in the data cluster has similar labels. Some highly relevant references including [10].

However, those methods use unlabeled data that rely on a certain assumption about data distribution. Generally, preprocessing methods are performed to change the data distribution in order to facilitate the model to learn such distribution. On contrary, these types of tasks are done significantly well by deep neural networks without any requirement of nasty and time-consuming feature engineering.

In 2014, Goodfellow et al. proposed a new framework for training generative models via an adversarial process [11]. This brand-new network architecture uses the images generated by the generator to confuse the discriminator, and the discriminator constantly compares the real image with the generated image to improve the ability of distinguishing true and fake samples. With repeating of those processes, the generator can generate a fake image, which lets the discriminator cant distinguish. After that, [12] applied GAN to the SSL problem of image classification for the first time and obtained incredible results. The goal of the generator is still to match the statistics between the generated sample and the real sample, while the discriminator is no longer a classification of K-class, instead of that, it labels the generated data with a new class y = K+1. Experiments [13] have shown that such K+1 class discriminators can obtain better classification performance. In [14], the authors proposed a new Bad GAN structure and proved that a good semi-supervised classification(SSC) GAN with a bad generator can achieve better performance.

All above SSL methods based on GAN are suitable for unstructured data, such as images, audio. But the application scenarios of classification are also involved in structured datasets, which is another part that can not be ignored in everyday lives. For example data from an online retail store might have rows as sales made by customers and columns as item bought, quantity, price, time stamp, and so on.

There are various options available for encoding variables such as label/numerical encoding and one-hot encoding to handle discrete data. But most of these techniques are problematic in terms of memory and real representation of categorical levels. The continuity of neural networks limits their applicability to categorical features. Therefore, simply applying networks on structured data with numerical encoding for category features does not work well.

Drawing on the idea of entity embedding [15] and some other algorithms [16], [17], [18], [19]. We proposed semi-supervised Embedding GAN (EmGAN).

Our method first maps categorical variables into Euclidean spaces, and the mapping result is the entity embedding of the categorical features. It is learned by the standard back-propagation algorithm. This method represents the intrinsic properties of categorical features by mapping similar samples close to each other in the Euclidean space, then concatenates the entity embeddings of the categorical variables and other numerical features as the input of the discriminator. The EmGAN does not use the convolution and pooling layer in the network structure to avoid loss of feature information during extraction. Detailed discussion can be seen in the last paragraph of section 3.1. Besides, a new loss function is proposed in processing with structured data. Also, KEEL-data sets [20] have been used to compare our proposed method with others. The experiments show our algorithms, compared with other traditional SSL algorithms, achieved better performance in structured data.

Section snippets

Related work

Our EmGAN has modified the structure of semi-supervised GAN and combines the theory of entity embedding. So we briefly introduce related works in semi-supervised GAN and entity embedding.

Embedding GAN

New architectural features and an objective function that we apply to the generative adversarial networks (GANs) framework are introduced in this section from three aspects: model structure, the objective function of generator, and discriminator.

Experiments

In this section, several metrics are provided to analyze differences in the performance of our network and other semi-supervised algorithms. In this section, we first introduce the experimental benchmark datasets and the hyperparameters in the algorithm. Then all the compared algorithms and parameters setting related to the experiments are described. In the end, a brief introduction of the statistical tests is used to contrast the data obtained. The running time is tested on our local machine,

Results

This section will be divided into four distinct partitions: analysis of the results obtained in transductive learning and inductive learning considering different ratios of labeled data (see 5.1–5.2). A statistical test about 8 outstanding methods with 4 labeled ratios is added in 5.3. Finally, a comparison with different objective functions is performed in 5.5 and the evaluation of serval supervised learning methods is executed in 5.6.

Conclusions

In this research, we propose Embedding GAN for applications of data mining. EmGAN trained a generative model G(x;θg) that has learned the concatenate data E(xCT)+xNL distribution. G(x;θg) is used to generate complement data to enforce the performance of discriminator. D(x;θd) is used to classify the data set {xl,xu,xg}. We evaluate the model on KEEL datasets with different labeled instance ratios, metrics and compared to multiple SSC methods. The experiments indicate that our method would be

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The author gratefully acknowledges support from National Natural Science Foundation of China projects of grant No. 61772553, 61379058.

References (34)

  • S. Wan et al.

    Faster r-cnn for multi-class fruit detection using a robotic vision system

    Comput. Netw.

    (2020)
  • Y. Zhou et al.

    Proc. - Int. Conf. Tools with Artif. Intell. ICTAI.

    (2004)
  • Z. Zhou et al.

    Tri-training: exploiting unlabeled data using three classifiers

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • J.R. Pasillas-Díaz et al.

    Bagged subspaces for unsupervised outlier detection

    Comput. Intell.

    (2017)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

    Proceedings of the 11th Annual Conference on Computational Learning Theory

    (1998)
  • K. Nigam et al.

    Text classification from labeled and unlabeled documents using em

    Mach. Learn.

    (2000)
  • X. Tang et al.

    Semi-supervised Bayesian artmap

    Appl. Intell.

    (2010)
  • A. Blum et al.

    Learning from Labeled and Unlabeled Data using Graph Mincuts

    Science

    (2001)
  • J. Wang et al.

    Semi-supervised learning using greedy max-cut

    J. Mach. Learn. Res.

    (2013)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Min. Knowl. Discov.

    (1998)
  • O. Chapelle et al.

    Optimization techniques for semi-supervised support vector machines

    J. Mach. Learn. Res.

    (2008)
  • I.J. Goodfellow et al.

    Generative Adversarial Nets

    (2014)
  • A. Odena

    Semi-supervised learning with generative adversarial networks.

    arXiv:1606.01583

    (2016)
  • T. Salimans et al.

    Improved techniques for training gans

    Proc. NIPS Conf. 30

    (2016)
  • Z. Dai et al.

    Good semi-supervised learning that requires a bad gan

    Proc. NIPS Conf. 31

    (2017)
  • C. Guo et al.

    Entity embeddings of categorical variables

    arXiv:1604.06737

    (2016)
  • X. Deng et al.

    An influence model based on heterogeneous online social network for influence maximization

    IEEE Trans. Netw. Sci. Eng.

    (2019)
  • Cited by (0)

    View full text