Rebuilding sample distributions for small dataset learning
Introduction
Over the past few decades, numerous machine learning algorithms have been developed to extract knowledge from data [1]. However, the majority of these algorithms were developed based on the assumption that training sets can represent the properties of populations. Conversely, if the training data contain insufficient information about the populations, the learning algorithms may output less precise results for future events.
Although issues related to big-data learning have only attracted attention in recent years, issues related to small-data learning were revealed by Student's t-distribution [2] in 1908. The collection of additional samples to enlarge a sample size and ensure that algorithms can perform sufficient learning is sometimes difficult and/or expensive in certain situations, such as the diagnoses of rare diseases [3], [4], examination of deoxyribonucleic acid (DNA) microarrays [5], pattern recognition with limited pixels [6], [7], development of new products [8], and systems in their initial stages [9]. Methods for effectively learning robust and accurate information from small data is an issue that is worthy of additional research.
To demonstrate how small data affect the learning results of most algorithms, Fig. 1 displays two possible distributions of two small datasets with regard to their populations. In Fig. 1(a), the instances are evenly distributed in a population. Although most learning approaches can extract exact knowledge from a population, only a small amount of information will be obtained. Conversely, in Fig. 1(b), the instances are concentrated in a part of the population. The majority of learning approaches will produce biased outcomes regardless of the data size.
In addition to the sample distribution, another issue that can cause insufficient information to be obtained is the gaps between two observations in small data. As shown in Fig. 2, although the observations are evenly distributed in the population, gaps exist between two observations in a small dataset. These gaps (referred to as information gaps) should be filled with observations in a complete dataset; however, these observations are not available. Most learning algorithms fail to train their patterns with the unavailable instances in the information gaps in small datasets, and therefore, the obtained information is inadequate. For example, most tree-based algorithms, such as the C4.5 decision tree [10], need to partition continuous data into discrete intervals before evaluating the classification purity. However, the expected size of an interval is usually unavailable in small datasets since some intervals that contain no observations are integrated with their nearest intervals. If an insufficient number of candidate positions exist for the purity evaluation, then the trees that are built and the resulting hierarchy of the classification rules will be small.
Virtual sample generation (VSG) methods can be employed to address the learning problem of small data. These methods are a type of data-preprocessing method that is applied in the process of knowledge discovery in databases (KDD) [11]; research has demonstrated their effectiveness [12]. One of the most extensively applied VSG methods is the bootstrapping procedure (BP) [13], which creates new training sets (referred to as bootstrapping sets) by resampling instances from the original data with a certain probability. The benefit of this approach is that most learning algorithms train a sample at least twice to gradually revise the identified patterns, which enables them to represent the behaviors of the actual data. To overcome the over-fitting issue in training sets, numerous ensemble learning methods were developed, such as bagging [14] and random forests [15], which employ BP to create bootstrapping sets for algorithms to build classifiers and determine classes by voting. Currently, bagging and random forests are extensively applied to extract knowledge from big data since each bootstrapping set can denote one evenly distributed part of a population.
When applying bagging or random forests to learn with small data, the use of bootstrapping sets may create two issues: an unstable data structure and overfitting, as shown in Fig. 3(a) and Fig. 3(b), respectively. A comparison of Fig. 2 with Fig. 3(a) reveals that certain observations in Fig. 2 are missing in Fig. 3(a) because they were not selected with a certain probability when forming the bootstrapping sets. The number of observations is very small, and thus, the difference between the features of the two bootstrapping sets in Fig. 3(a) is large. Since the amount of information provided by small data is minimal, any missing observations in the bootstrapping sets can increase the loss of information. Although we can double the observations to form the bootstrapping sets, as shown in Fig. 3(b), this step usually causes the patterns identified by the algorithms to represent the behaviors of a few observations, which causes overfitting. The amount of information provided by the set in Fig. 3(b) does not increase because the increased information is the same information provided by the same observations.
The synthetic minority over-sampling technique (SMOTE) [16] was proposed to generate artificial samples that differ from the original samples in the minority class. Based on the k-nearest neighbors, the SMOTE generates synthetic data along continuous vectors between the minority class's instances and their nearest neighbors, as shown in Fig. 4. Although the information gaps in the minority class are filled with synthetic instances, they are distributed within the domain of the real instances in the minority class. The method employed by the SMOTE to generate samples is simple and does not consider the possible distributions of the entire minority class.
Since the fuzzy theory was proposed by Zadeh [17] in 1965, it has been employed to handle uncertain events. For example, to expand crisp observations to fill the information gaps caused by a lack of data, Huang [18] proposed the principle of information diffusion, in which a normal diffusion function that is developed based on fuzzy theories is defined aswhere h is the diffusion coefficient and n is the attribute size. In 2004, Huang and Moraga [19] proposed the diffusion neural network (DNN), which derives a pair of artificial samples for each observation based on Eq. (1) to fill the information gaps. Since the parameter n in Eq. (1) are determined to be a constant when a dataset is given, Huang and Moraga [19] applied the part exp[−(x − xi)2/2h2] to derive the virtual values of an observation (x, y) as and for a two-dimensional dataset, where h are the diffusion coefficients, which are the inductions from a large amount of simulation results in DNN, r is the correlation coefficient of input X and output Y, and ψ(r) is the transforming function, which is defined asmapping r into a possibility value. For example, if r is 0.93 (or 0.96), then m is 3 (or 6) and the possibility ψ(r) is 0.999 (or 0.999999).
Since DNN needs r to be > 0.9, the applicability of the DNN method is limited with regard to most practical cases. In addition, the distributions constructed by DNN have information gaps because DNN only considers the behavior of an individual observation rather than the behavior of an entire dataset.
To improve the robustness and/or accuracy of the forecasting models produced by data preprocessing when learning with small data, this study proposes a systematical procedure to create new training sets by rebuilding possible sample distributions. The procedure contains a set of new functions that estimate the possible ranges of observations and a sample generation method that considers the relations among attributes, where the functions and the method are developed based on the fuzzy normal function [18] and fuzzy concepts, respectively.
In the experiments, two learning algorithms—a back-propagation neural network (BPN) and support vector regression (SVR)—are adopted to build models. Two VSG approaches—bagging (using BP) and the SMOTE—are employed to compare the effectiveness of the models. Two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry in Taiwan are examined. The results indicate that the proposed method is more effective than bagging and the SMOTE for learning from the two cases with the greatest amount of statistical support.
The remainder of this study is organized as follows. Section 2 introduces the proposed procedure, and Section 3 describes the experimental environment and the background of the two real cases. Section 4 discusses the experimental results, and Section 5 presents the conclusions of this paper.
Section snippets
Proposed methodology
Two steps can be employed to enhance the data structures of small data by rebuilding the possible sample distributions: estimating the sample distributions and creating new samples. Before introducing the proposed method, the notations in this work are defined.
Experimental environment
This section presents the designs of experiments and a description of two real cases of a TFT-LCD maker in Taiwan.
Experimental results and discoveries
In the sensitivity analysis, the training size n in Case I is 5, 10, and 15; in Case II, the training size n is 5, 10, 15, and 20. The experimental results are summarized in Fig. 13 and Fig. 14, where “SDS,” “bagging,” “SMOTE,” and “PM” denote the control, experiment 1 group, experiment 2 group and experiment 3 group, respectively; the symbol “-” in “PM” indicates no significant difference between “PM” and “SDS;” the rectangles with solid lines and dotted lines indicate no significant
Conclusions
While issues surrounding big data learning have attracted a substantial amount of attention in recent years, learning from small data is an older problem. In certain situations, data analyzers have to learn with small amounts of data, such as the two cases described in this paper. Numerous virtual sample generation approaches, which are data preprocessing methods in KDD, have been developed to extract knowledge from these data. Although BP is extensively applied to create new training sets, the
Der-Chiang Li is a Distinguished Professor at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his PhD degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on machine learning with small data sets. His articles have appeared in Decision Support Systems, Omega, Information Sciences, European Journal of Operational Research,
References (24)
- et al.
A case-based reasoning system for aiding detection and classification of nosocomial infections
Decis. Support. Syst.
(2016) - et al.
A new approach to prediction of radiotherapy of bladder cancer cells in small dataset analysis
Expert Syst. Appl.
(2011) - et al.
Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining
Decis. Support. Syst.
(2007) - et al.
Generating information for small data sets with a multi-modal distribution
Decis. Support. Syst.
(2014) - et al.
Predicting heart transplantation outcomes through data analytics
Decis. Support. Syst.
(2017) Fuzzy sets
Inf. Control.
(1965)- et al.
A diffusion-neural-network for learning from small samples
Int. J. Approx. Reason.
(2004) The probable error of a mean
Biometrika
(1908)- et al.
Prediction of the period of psychotic episode in individual schizophrenics by simulation-data construction approach
J. Med. Syst.
(2010) - et al.
Incorporating prior information in machine learning by creating virtual examples
Proc. IEEE
(1998)
Learning from examples in the small sample case: face expression recognition
IEEE Trans. Syst. Man Cybern. B Cybern.
Employing virtual samples to build early high-dimensional manufacturing models
Int. J. Prod. Res.
Cited by (31)
Multi-variety and small-batch production quality forecasting by novel data-driven grey Weibull model
2023, Engineering Applications of Artificial IntelligenceA new method for improving prediction performance in neural networks with insufficient data
2023, Decision Analytics JournalDual adversarial learning-based virtual sample generation method for data expansion of soft senors
2022, Measurement: Journal of the International Measurement ConfederationVirtual sample generation for few-shot source camera identification
2022, Journal of Information Security and Applications
Der-Chiang Li is a Distinguished Professor at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his PhD degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on machine learning with small data sets. His articles have appeared in Decision Support Systems, Omega, Information Sciences, European Journal of Operational Research, Computers & Operations Research, International Journal of Production Research, and other publications
Wu-Kuo Lin is a Ph.D. candidate at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His current research interests are in the area of forecasting and data mining with small data sets. His articles have appeared in Expert Systems with Applications and The journal of Grey System.
Chien-Chih Chen is a postdoctoral fellow at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His article has appeared in Omega, Expert Systems with Applications, International Journal of Production Research, Neurocomputing, Computers & Industrial Engineering, Journal of Intelligent Manufacturing, and other publications.
Hung-Yu Chen is a Ph.D. candidate at the Institute of Information Management, the National Cheng Kung University, Taiwan. His current research interests focus on the learning issue of small datasets.
Liang-Sian Lin is a PH.D researcher at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He is also working at the laboratory for small sample learning. As a research professor, his current interests concentrate on small data sets. His article has appeared in European Journal of Operational Research, Decision Support Systems, and International Journal of Production Research.