Elsevier

Decision Support Systems

Volume 105, January 2018, Pages 66-76
Decision Support Systems

Rebuilding sample distributions for small dataset learning

https://doi.org/10.1016/j.dss.2017.10.013Get rights and content

Highlights

  • Most algorithms often output unsatisfying predictions when working with small data.

  • A data-driven method is proposed for enhancing the data structures of small data.

  • A set of new functions are derived to estimate the domains of small data.

  • The method successfully improves the predictions of algorithms in two real cases.

Abstract

Over the past few decades, a few learning algorithms have been proposed to extract knowledge from data. The majority of these algorithms have been developed with the assumption that training sets can denote populations. When the training sets contain only a few properties of their populations, the algorithms may extract minimal and/or biased knowledge for decision makers. This study develops a systematic procedure based on fuzzy theories to create new training sets by rebuilding the possible sample distributions, where the procedure contains new functions that estimate domains and a sample generating method. In this study, two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry are examined. Two learning algorithms—a back-propagation neural network and support vector regression—are employed for modeling, and two sample generation approaches—bootstrap aggregating (bagging) and the synthetic minority over-sampling technique (SMOTE)—are employed to compare the accuracy of the models. The results indicate that the proposed method outperforms bagging and the SMOTE with the greatest amount of statistical support.

Introduction

Over the past few decades, numerous machine learning algorithms have been developed to extract knowledge from data [1]. However, the majority of these algorithms were developed based on the assumption that training sets can represent the properties of populations. Conversely, if the training data contain insufficient information about the populations, the learning algorithms may output less precise results for future events.

Although issues related to big-data learning have only attracted attention in recent years, issues related to small-data learning were revealed by Student's t-distribution [2] in 1908. The collection of additional samples to enlarge a sample size and ensure that algorithms can perform sufficient learning is sometimes difficult and/or expensive in certain situations, such as the diagnoses of rare diseases [3], [4], examination of deoxyribonucleic acid (DNA) microarrays [5], pattern recognition with limited pixels [6], [7], development of new products [8], and systems in their initial stages [9]. Methods for effectively learning robust and accurate information from small data is an issue that is worthy of additional research.

To demonstrate how small data affect the learning results of most algorithms, Fig. 1 displays two possible distributions of two small datasets with regard to their populations. In Fig. 1(a), the instances are evenly distributed in a population. Although most learning approaches can extract exact knowledge from a population, only a small amount of information will be obtained. Conversely, in Fig. 1(b), the instances are concentrated in a part of the population. The majority of learning approaches will produce biased outcomes regardless of the data size.

In addition to the sample distribution, another issue that can cause insufficient information to be obtained is the gaps between two observations in small data. As shown in Fig. 2, although the observations are evenly distributed in the population, gaps exist between two observations in a small dataset. These gaps (referred to as information gaps) should be filled with observations in a complete dataset; however, these observations are not available. Most learning algorithms fail to train their patterns with the unavailable instances in the information gaps in small datasets, and therefore, the obtained information is inadequate. For example, most tree-based algorithms, such as the C4.5 decision tree [10], need to partition continuous data into discrete intervals before evaluating the classification purity. However, the expected size of an interval is usually unavailable in small datasets since some intervals that contain no observations are integrated with their nearest intervals. If an insufficient number of candidate positions exist for the purity evaluation, then the trees that are built and the resulting hierarchy of the classification rules will be small.

Virtual sample generation (VSG) methods can be employed to address the learning problem of small data. These methods are a type of data-preprocessing method that is applied in the process of knowledge discovery in databases (KDD) [11]; research has demonstrated their effectiveness [12]. One of the most extensively applied VSG methods is the bootstrapping procedure (BP) [13], which creates new training sets (referred to as bootstrapping sets) by resampling instances from the original data with a certain probability. The benefit of this approach is that most learning algorithms train a sample at least twice to gradually revise the identified patterns, which enables them to represent the behaviors of the actual data. To overcome the over-fitting issue in training sets, numerous ensemble learning methods were developed, such as bagging [14] and random forests [15], which employ BP to create bootstrapping sets for algorithms to build classifiers and determine classes by voting. Currently, bagging and random forests are extensively applied to extract knowledge from big data since each bootstrapping set can denote one evenly distributed part of a population.

When applying bagging or random forests to learn with small data, the use of bootstrapping sets may create two issues: an unstable data structure and overfitting, as shown in Fig. 3(a) and Fig. 3(b), respectively. A comparison of Fig. 2 with Fig. 3(a) reveals that certain observations in Fig. 2 are missing in Fig. 3(a) because they were not selected with a certain probability when forming the bootstrapping sets. The number of observations is very small, and thus, the difference between the features of the two bootstrapping sets in Fig. 3(a) is large. Since the amount of information provided by small data is minimal, any missing observations in the bootstrapping sets can increase the loss of information. Although we can double the observations to form the bootstrapping sets, as shown in Fig. 3(b), this step usually causes the patterns identified by the algorithms to represent the behaviors of a few observations, which causes overfitting. The amount of information provided by the set in Fig. 3(b) does not increase because the increased information is the same information provided by the same observations.

The synthetic minority over-sampling technique (SMOTE) [16] was proposed to generate artificial samples that differ from the original samples in the minority class. Based on the k-nearest neighbors, the SMOTE generates synthetic data along continuous vectors between the minority class's instances and their nearest neighbors, as shown in Fig. 4. Although the information gaps in the minority class are filled with synthetic instances, they are distributed within the domain of the real instances in the minority class. The method employed by the SMOTE to generate samples is simple and does not consider the possible distributions of the entire minority class.

Since the fuzzy theory was proposed by Zadeh [17] in 1965, it has been employed to handle uncertain events. For example, to expand crisp observations to fill the information gaps caused by a lack of data, Huang [18] proposed the principle of information diffusion, in which a normal diffusion function that is developed based on fuzzy theories is defined asf˜nx=1nh2πi=1nexpxxi22h2,where h is the diffusion coefficient and n is the attribute size. In 2004, Huang and Moraga [19] proposed the diffusion neural network (DNN), which derives a pair of artificial samples for each observation based on Eq. (1) to fill the information gaps. Since the parameter n in Eq. (1) are determined to be a constant when a dataset is given, Huang and Moraga [19] applied the part exp[−(x  xi)2/2h2] to derive the virtual values of an observation (x, y) as x'=x±2hx2lnψr and y'=y±2hy2lnψr for a two-dimensional dataset, where h are the diffusion coefficients, which are the inductions from a large amount of simulation results in DNN, r is the correlation coefficient of input X and output Y, and ψ(r) is the transforming function, which is defined asψr=ψ0.9+m×1020.999m9sr0.910.920.99,mapping r into a possibility value. For example, if r is 0.93 (or 0.96), then m is 3 (or 6) and the possibility ψ(r) is 0.999 (or 0.999999).

Since DNN needs r to be > 0.9, the applicability of the DNN method is limited with regard to most practical cases. In addition, the distributions constructed by DNN have information gaps because DNN only considers the behavior of an individual observation rather than the behavior of an entire dataset.

To improve the robustness and/or accuracy of the forecasting models produced by data preprocessing when learning with small data, this study proposes a systematical procedure to create new training sets by rebuilding possible sample distributions. The procedure contains a set of new functions that estimate the possible ranges of observations and a sample generation method that considers the relations among attributes, where the functions and the method are developed based on the fuzzy normal function [18] and fuzzy concepts, respectively.

In the experiments, two learning algorithms—a back-propagation neural network (BPN) and support vector regression (SVR)—are adopted to build models. Two VSG approaches—bagging (using BP) and the SMOTE—are employed to compare the effectiveness of the models. Two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry in Taiwan are examined. The results indicate that the proposed method is more effective than bagging and the SMOTE for learning from the two cases with the greatest amount of statistical support.

The remainder of this study is organized as follows. Section 2 introduces the proposed procedure, and Section 3 describes the experimental environment and the background of the two real cases. Section 4 discusses the experimental results, and Section 5 presents the conclusions of this paper.

Section snippets

Proposed methodology

Two steps can be employed to enhance the data structures of small data by rebuilding the possible sample distributions: estimating the sample distributions and creating new samples. Before introducing the proposed method, the notations in this work are defined.

Experimental environment

This section presents the designs of experiments and a description of two real cases of a TFT-LCD maker in Taiwan.

Experimental results and discoveries

In the sensitivity analysis, the training size n in Case I is 5, 10, and 15; in Case II, the training size n is 5, 10, 15, and 20. The experimental results are summarized in Fig. 13 and Fig. 14, where “SDS,” “bagging,” “SMOTE,” and “PM” denote the control, experiment 1 group, experiment 2 group and experiment 3 group, respectively; the symbol “-” in “PM” indicates no significant difference between “PM” and “SDS;” the rectangles with solid lines and dotted lines indicate no significant

Conclusions

While issues surrounding big data learning have attracted a substantial amount of attention in recent years, learning from small data is an older problem. In certain situations, data analyzers have to learn with small amounts of data, such as the two cases described in this paper. Numerous virtual sample generation approaches, which are data preprocessing methods in KDD, have been developed to extract knowledge from these data. Although BP is extensively applied to create new training sets, the

Der-Chiang Li is a Distinguished Professor at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his PhD degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on machine learning with small data sets. His articles have appeared in Decision Support Systems, Omega, Information Sciences, European Journal of Operational Research,

References (24)

  • G. Guo et al.

    Learning from examples in the small sample case: face expression recognition

    IEEE Trans. Syst. Man Cybern. B Cybern.

    (2005)
  • D.C. Li et al.

    Employing virtual samples to build early high-dimensional manufacturing models

    Int. J. Prod. Res.

    (2013)
  • Cited by (31)

    • Dual adversarial learning-based virtual sample generation method for data expansion of soft senors

      2022, Measurement: Journal of the International Measurement Confederation
    • Virtual sample generation for few-shot source camera identification

      2022, Journal of Information Security and Applications
    View all citing articles on Scopus

    Der-Chiang Li is a Distinguished Professor at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his PhD degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on machine learning with small data sets. His articles have appeared in Decision Support Systems, Omega, Information Sciences, European Journal of Operational Research, Computers & Operations Research, International Journal of Production Research, and other publications

    Wu-Kuo Lin is a Ph.D. candidate at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His current research interests are in the area of forecasting and data mining with small data sets. His articles have appeared in Expert Systems with Applications and The journal of Grey System.

    Chien-Chih Chen is a postdoctoral fellow at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His article has appeared in Omega, Expert Systems with Applications, International Journal of Production Research, Neurocomputing, Computers & Industrial Engineering, Journal of Intelligent Manufacturing, and other publications.

    Hung-Yu Chen is a Ph.D. candidate at the Institute of Information Management, the National Cheng Kung University, Taiwan. His current research interests focus on the learning issue of small datasets.

    Liang-Sian Lin is a PH.D researcher at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He is also working at the laboratory for small sample learning. As a research professor, his current interests concentrate on small data sets. His article has appeared in European Journal of Operational Research, Decision Support Systems, and International Journal of Production Research.

    View full text