Rebuilding sample distributions for small dataset learning

doi:10.1016/j.dss.2017.10.013

Decision Support Systems

Volume 105, January 2018, Pages 66-76

https://doi.org/10.1016/j.dss.2017.10.013 Get rights and content

Highlights

•
Most algorithms often output unsatisfying predictions when working with small data.
•
A data-driven method is proposed for enhancing the data structures of small data.
•
A set of new functions are derived to estimate the domains of small data.
•
The method successfully improves the predictions of algorithms in two real cases.

Abstract

Over the past few decades, a few learning algorithms have been proposed to extract knowledge from data. The majority of these algorithms have been developed with the assumption that training sets can denote populations. When the training sets contain only a few properties of their populations, the algorithms may extract minimal and/or biased knowledge for decision makers. This study develops a systematic procedure based on fuzzy theories to create new training sets by rebuilding the possible sample distributions, where the procedure contains new functions that estimate domains and a sample generating method. In this study, two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry are examined. Two learning algorithms—a back-propagation neural network and support vector regression—are employed for modeling, and two sample generation approaches—bootstrap aggregating (bagging) and the synthetic minority over-sampling technique (SMOTE)—are employed to compare the accuracy of the models. The results indicate that the proposed method outperforms bagging and the SMOTE with the greatest amount of statistical support.

Introduction

Over the past few decades, numerous machine learning algorithms have been developed to extract knowledge from data [1]. However, the majority of these algorithms were developed based on the assumption that training sets can represent the properties of populations. Conversely, if the training data contain insufficient information about the populations, the learning algorithms may output less precise results for future events.

Although issues related to big-data learning have only attracted attention in recent years, issues related to small-data learning were revealed by Student's t-distribution [2] in 1908. The collection of additional samples to enlarge a sample size and ensure that algorithms can perform sufficient learning is sometimes difficult and/or expensive in certain situations, such as the diagnoses of rare diseases [3], [4], examination of deoxyribonucleic acid (DNA) microarrays [5], pattern recognition with limited pixels [6], [7], development of new products [8], and systems in their initial stages [9]. Methods for effectively learning robust and accurate information from small data is an issue that is worthy of additional research.

To demonstrate how small data affect the learning results of most algorithms, Fig. 1 displays two possible distributions of two small datasets with regard to their populations. In Fig. 1(a), the instances are evenly distributed in a population. Although most learning approaches can extract exact knowledge from a population, only a small amount of information will be obtained. Conversely, in Fig. 1(b), the instances are concentrated in a part of the population. The majority of learning approaches will produce biased outcomes regardless of the data size.

In addition to the sample distribution, another issue that can cause insufficient information to be obtained is the gaps between two observations in small data. As shown in Fig. 2, although the observations are evenly distributed in the population, gaps exist between two observations in a small dataset. These gaps (referred to as information gaps) should be filled with observations in a complete dataset; however, these observations are not available. Most learning algorithms fail to train their patterns with the unavailable instances in the information gaps in small datasets, and therefore, the obtained information is inadequate. For example, most tree-based algorithms, such as the C4.5 decision tree [10], need to partition continuous data into discrete intervals before evaluating the classification purity. However, the expected size of an interval is usually unavailable in small datasets since some intervals that contain no observations are integrated with their nearest intervals. If an insufficient number of candidate positions exist for the purity evaluation, then the trees that are built and the resulting hierarchy of the classification rules will be small.

Virtual sample generation (VSG) methods can be employed to address the learning problem of small data. These methods are a type of data-preprocessing method that is applied in the process of knowledge discovery in databases (KDD) [11]; research has demonstrated their effectiveness [12]. One of the most extensively applied VSG methods is the bootstrapping procedure (BP) [13], which creates new training sets (referred to as bootstrapping sets) by resampling instances from the original data with a certain probability. The benefit of this approach is that most learning algorithms train a sample at least twice to gradually revise the identified patterns, which enables them to represent the behaviors of the actual data. To overcome the over-fitting issue in training sets, numerous ensemble learning methods were developed, such as bagging [14] and random forests [15], which employ BP to create bootstrapping sets for algorithms to build classifiers and determine classes by voting. Currently, bagging and random forests are extensively applied to extract knowledge from big data since each bootstrapping set can denote one evenly distributed part of a population.

When applying bagging or random forests to learn with small data, the use of bootstrapping sets may create two issues: an unstable data structure and overfitting, as shown in Fig. 3(a) and Fig. 3(b), respectively. A comparison of Fig. 2 with Fig. 3(a) reveals that certain observations in Fig. 2 are missing in Fig. 3(a) because they were not selected with a certain probability when forming the bootstrapping sets. The number of observations is very small, and thus, the difference between the features of the two bootstrapping sets in Fig. 3(a) is large. Since the amount of information provided by small data is minimal, any missing observations in the bootstrapping sets can increase the loss of information. Although we can double the observations to form the bootstrapping sets, as shown in Fig. 3(b), this step usually causes the patterns identified by the algorithms to represent the behaviors of a few observations, which causes overfitting. The amount of information provided by the set in Fig. 3(b) does not increase because the increased information is the same information provided by the same observations.

The synthetic minority over-sampling technique (SMOTE) [16] was proposed to generate artificial samples that differ from the original samples in the minority class. Based on the k-nearest neighbors, the SMOTE generates synthetic data along continuous vectors between the minority class's instances and their nearest neighbors, as shown in Fig. 4. Although the information gaps in the minority class are filled with synthetic instances, they are distributed within the domain of the real instances in the minority class. The method employed by the SMOTE to generate samples is simple and does not consider the possible distributions of the entire minority class.

Since the fuzzy theory was proposed by Zadeh [17] in 1965, it has been employed to handle uncertain events. For example, to expand crisp observations to fill the information gaps caused by a lack of data, Huang [18] proposed the principle of information diffusion, in which a normal diffusion function that is developed based on fuzzy theories is defined as ${\tilde{f}}_{n} (x) = \frac{1}{nh \sqrt{2 π}} \sum_{i = 1}^{n} exp [- \frac{{(x - x_{i})}^{2}}{2 h^{2}}],$ where h is the diffusion coefficient and n is the attribute size. In 2004, Huang and Moraga [19] proposed the diffusion neural network (DNN), which derives a pair of artificial samples for each observation based on Eq. (1) to fill the information gaps. Since the parameter n in Eq. (1) are determined to be a constant when a dataset is given, Huang and Moraga [19] applied the part exp[−(x − x_i)²/2h²] to derive the virtual values of an observation (x, y) as $x^{'} = x \pm \sqrt{- 2 {h_{x}}^{2} ln ψ (r)}$ and $y^{'} = y \pm \sqrt{- 2 {h_{y}}^{2} ln ψ (r)}$ for a two-dimensional dataset, where h are the diffusion coefficients, which are the inductions from a large amount of simulation results in DNN, r is the correlation coefficient of input X and output Y, and ψ(r) is the transforming function, which is defined as $ψ (r) = ψ (0.9 + m \times 10^{- 2}) \mapsto \underset{m 9 s}{\underset{⏟}{0.9 \dots 99}} \forall r \in \{0.91, 0.92, \dots, 0.99\},$ mapping r into a possibility value. For example, if r is 0.93 (or 0.96), then m is 3 (or 6) and the possibility ψ(r) is 0.999 (or 0.999999).

Since DNN needs r to be > 0.9, the applicability of the DNN method is limited with regard to most practical cases. In addition, the distributions constructed by DNN have information gaps because DNN only considers the behavior of an individual observation rather than the behavior of an entire dataset.

To improve the robustness and/or accuracy of the forecasting models produced by data preprocessing when learning with small data, this study proposes a systematical procedure to create new training sets by rebuilding possible sample distributions. The procedure contains a set of new functions that estimate the possible ranges of observations and a sample generation method that considers the relations among attributes, where the functions and the method are developed based on the fuzzy normal function [18] and fuzzy concepts, respectively.

In the experiments, two learning algorithms—a back-propagation neural network (BPN) and support vector regression (SVR)—are adopted to build models. Two VSG approaches—bagging (using BP) and the SMOTE—are employed to compare the effectiveness of the models. Two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry in Taiwan are examined. The results indicate that the proposed method is more effective than bagging and the SMOTE for learning from the two cases with the greatest amount of statistical support.

The remainder of this study is organized as follows. Section 2 introduces the proposed procedure, and Section 3 describes the experimental environment and the background of the two real cases. Section 4 discusses the experimental results, and Section 5 presents the conclusions of this paper.

Section snippets

Proposed methodology

Two steps can be employed to enhance the data structures of small data by rebuilding the possible sample distributions: estimating the sample distributions and creating new samples. Before introducing the proposed method, the notations in this work are defined.

Experimental environment

This section presents the designs of experiments and a description of two real cases of a TFT-LCD maker in Taiwan.

Experimental results and discoveries

In the sensitivity analysis, the training size n in Case I is 5, 10, and 15; in Case II, the training size n is 5, 10, 15, and 20. The experimental results are summarized in Fig. 13 and Fig. 14, where “SDS,” “bagging,” “SMOTE,” and “PM” denote the control, experiment 1 group, experiment 2 group and experiment 3 group, respectively; the symbol “-” in “PM” indicates no significant difference between “PM” and “SDS;” the rectangles with solid lines and dotted lines indicate no significant

Conclusions

While issues surrounding big data learning have attracted a substantial amount of attention in recent years, learning from small data is an older problem. In certain situations, data analyzers have to learn with small amounts of data, such as the two cases described in this paper. Numerous virtual sample generation approaches, which are data preprocessing methods in KDD, have been developed to extract knowledge from these data. Although BP is extensively applied to create new training sets, the

Der-Chiang Li is a Distinguished Professor at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He received his PhD degree at the Department of Industrial Engineering at Lamar University Beaumont, Texas, USA, in 1985. As a research professor, his current interest concentrates on machine learning with small data sets. His articles have appeared in Decision Support Systems, Omega, Information Sciences, European Journal of Operational Research,

References (24)

H.J. Gómez-Vallejo et al.
A case-based reasoning system for aiding detection and classification of nosocomial infections
Decis. Support. Syst.
(2016)
G.Y. Chao et al.
A new approach to prediction of radiotherapy of bladder cancer cells in small dataset analysis
Expert Syst. Appl.
(2011)
Z. Huang et al.
Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining
Decis. Support. Syst.
(2007)
D.C. Li et al.
Generating information for small data sets with a multi-modal distribution
Decis. Support. Syst.
(2014)
A. Dag et al.
Predicting heart transplantation outcomes through data analytics
Decis. Support. Syst.
(2017)
L.A. Zadeh
Fuzzy sets
Inf. Control.
(1965)
C.F. Huang et al.
A diffusion-neural-network for learning from small samples
Int. J. Approx. Reason.
(2004)
W.S. Gosset
The probable error of a mean
Biometrika
(1908)
C.J. Huang et al.
Prediction of the period of psychotic episode in individual schizophrenics by simulation-data construction approach
J. Med. Syst.
(2010)
P. Niyogi et al.
Incorporating prior information in machine learning by creating virtual examples
Proc. IEEE
(1998)

G. Guo et al.

Learning from examples in the small sample case: face expression recognition

IEEE Trans. Syst. Man Cybern. B Cybern.

(2005)

D.C. Li et al.

Employing virtual samples to build early high-dimensional manufacturing models

Int. J. Prod. Res.

(2013)

Cited by (31)

Multi-variety and small-batch production quality forecasting by novel data-driven grey Weibull model
2023, Engineering Applications of Artificial Intelligence
With the coming of intelligent manufacturing, multi-variety and small-batch production mode has gradually become popular. Aiming at the characteristics of high dimensional information and limited samples in this mode, a novel data-driven grey Weibull model is established for product quality prediction. Firstly, a high-dimensional information from the production process is integrated as the process variation using a data-driven method. Then, the quality prediction function is deducted by considering that process variation follows Weibull distribution-based mechanism, the Hausdorff difference scheme is adopted to weaken the error from the difference to the differential, heuristic algorithm is selected to optimize the distribution parameters of the model. Finally, an experimental analysis is designed using the dataset from some personalized customization manufacturer in China. Results show that the proposed model is not only superior to the other eight models in terms of stability and prediction accuracy, but also boasts the features of amalgamation of data-driven and mechanism-driven methods, which can simultaneously process high-dimensional information and limited samples in multi-variety and small-batch production system.
A new method for improving prediction performance in neural networks with insufficient data
2023, Decision Analytics Journal
This paper proposes Simultaneous Trainings of Identical Neural Networks (STNN) that aims to predict when sufficient data is not available for training neural networks (NN). While predictive applications of neural networks are growing, a common assumption in the NN algorithms is to have a training dataset that is large enough to sufficiently represent the population. However, in practice, this is difficult or expensive where the size of datasets is limited by the complexity and cost of large-scale experiments or data collections. Lacking sufficient data commits the NN training to two issues; namely parameter initialization and training sequence. STNN selects the outperforming NN out of several training episodes of the selected identical NN design by changing parameter initialization and training sequence. STNN has been evaluated by comparing with alternative methods in the literature. The results demonstrate improvement in prediction of STNN compared to other alternatives.
Proactive selection of machine learning models for small sample sizes in cerebral stroke detection based on PAC-learning theory
2023, Procedia Computer Science
The article presents a developed method of proactive selection of machine learning models, in which the PAC-learning theory and a parameterizied Pareto frontier are used to describe the hyperparameter ratio. The theoretical substantiation of the proposed method is given. Based on the developed method, a proactive comparison of ML models for detecting cerebral stroke was made. The theoretical results were compared with the results of an experimental evaluation of the effectiveness of the same models, as well as baseline model, and the results of the comparison show the validity of the developed approach. Besides, the developed theoretical approach made it possible to substantiate the choice of the XAI tool for solving a specific problem of detecting cerebral stroke. The proposed method can be used in various subject areas for which ML models with an identical structure of cores are being developed. In addition, according to the authors, it is of interest to study the possibilities of the method in relation to models with a different core structure, for example, to models of the Active Contour group, which, in particular, are used to assess the affected area in stroke. This will allow developers at the design stage to expand the range of compared models and thereby speed up and reduce the cost of the development process.
Small-sample continual learning classification method with vaccine to update memory cells based on the artificial immune system
2022, BioSystems
In this paper, a novel continual learning classification method (SCLM) in small sample cases is proposed, which inspired by the immune system's continuous improvement of immunity through injecting vaccines. Data-driven classification method requires a large number of historical data to establish a pattern recognition model with good generalization performance. However, in practice, the data that can be used for training is usually small and unbalanced, which lead to poor classification accuracy. In addition, batch learning method cannot improve continually classification performance by learning test phase data. In view of the above problems, SCLM generates sample as vaccine by finding the group center of training samples, so that B cells mature and activate memory cells in the train phase. In the test phase, the recognition ability of SCLM is further improved by learning new samples and updating memory cells. In order to evaluate its performance under the condition of less training samples and its possible advantages, the experiments on well-known datasets in UCI repository and reciprocating compressor faults diagnose were performed. The results show that SCLM has better classification performance than other methods when the number of training samples is insufficient. At the same time, the method of generating data has significantly improved the classification performance of other methods.
Dual adversarial learning-based virtual sample generation method for data expansion of soft senors
2022, Measurement: Journal of the International Measurement Confederation
Many key quality variables are difficult to measure in complex industrial processes for various reasons, such as working conditions or economic costs, leading to inefficient production monitoring. In recent years, soft sensors with outstanding performance in variable estimation have been widely used. However, quality samples collected from industrial sites are often limited, which results in incomplete datasets that cannot meet the training requirements of soft sensors and poor performance in model learning and prediction. In this paper, a new virtual sample generation method DA-GAN based on generative adversarial network (GAN) is proposed to provide extra training samples for soft sensors. Adversarial net-and adversarial sample-based dual adversarial learning is implemented to reduce the adversarial noise in the discriminator gradient, which can improve the convergence speed and learning stability of the generator and obtain virtual samples with higher similarity to the real data. Furthermore, a sample screening method based on asymmetric acceptable domain range expansion is introduced to choose high-quality virtual samples. Experimental results of two industrial case studies show that the virtual samples provided by DA-GAN are closer to real samples than several other widely used generation methods. The performance of the prediction model trained with the dataset added by the virtual samples yielded from DA-GAN can be better improved.
Virtual sample generation for few-shot source camera identification
2022, Journal of Information Security and Applications
The Source Camera Identification (SCI) has achieved remarkable success. However, existing approaches require sufficiently large training sets for high performance on accuracy and robustness. For maintaining high performance given small training sets, we propose a semi-supervised, Mega-Trent-Diffusion (MTD) method to generate virtual samples, such that the training sets can be expanded and unlabeled samples can be fully utilized as well. The stability of our method is improved using ensemble learning. Our theoretical analysis and experiments corroborate the effectiveness of our method beyond others when few-shot is given.

View all citing articles on Scopus

Wu-Kuo Lin is a Ph.D. candidate at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His current research interests are in the area of forecasting and data mining with small data sets. His articles have appeared in Expert Systems with Applications and The journal of Grey System.

Chien-Chih Chen is a postdoctoral fellow at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. His article has appeared in Omega, Expert Systems with Applications, International Journal of Production Research, Neurocomputing, Computers & Industrial Engineering, Journal of Intelligent Manufacturing, and other publications.

Hung-Yu Chen is a Ph.D. candidate at the Institute of Information Management, the National Cheng Kung University, Taiwan. His current research interests focus on the learning issue of small datasets.

Liang-Sian Lin is a PH.D researcher at the Department of Industrial and Information Management, the National Cheng Kung University, Taiwan. He is also working at the laboratory for small sample learning. As a research professor, his current interests concentrate on small data sets. His article has appeared in European Journal of Operational Research, Decision Support Systems, and International Journal of Production Research.

View full text

Rebuilding sample distributions for small dataset learning

Highlights

Abstract

Introduction

Section snippets

Proposed methodology

Experimental environment

Experimental results and discoveries

Conclusions

Decis. Support. Syst.

Expert Syst. Appl.

Decis. Support. Syst.

Decis. Support. Syst.

Decis. Support. Syst.

Inf. Control.

Int. J. Approx. Reason.

The probable error of a mean

Biometrika

Prediction of the period of psychotic episode in individual schizophrenics by simulation-data construction approach

J. Med. Syst.

Incorporating prior information in machine learning by creating virtual examples

Proc. IEEE

Learning from examples in the small sample case: face expression recognition

IEEE Trans. Syst. Man Cybern. B Cybern.

Employing virtual samples to build early high-dimensional manufacturing models

Int. J. Prod. Res.