Using virtual samples to improve learning performance for small datasets with multimodal distributions

Li, Der-Chiang; Lin, Liang-Sian; Chen, Chien-Chih; Yu, Wei-Hao

doi:10.1007/s00500-018-03744-z

Using virtual samples to improve learning performance for small datasets with multimodal distributions

Methodologies and Application
Published: 08 January 2019

Volume 23, pages 11883–11900, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Der-Chiang Li¹,
Liang-Sian Lin²,
Chien-Chih Chen¹ &
…
Wei-Hao Yu¹

394 Accesses
10 Citations
Explore all metrics

Abstract

A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 3

Building robust models for small data containing nominal inputs and continuous outputs based on possibility distributions

Article 08 April 2019

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Article Open access 01 April 2023

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

Article 23 December 2022

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York, p 2
Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Article MathSciNet Google Scholar
Akgül FG, Şenoğlu B, Arslan T (2016) An alternative distribution to Weibull for modeling the wind speed data: inverse Weibull distribution. Energy Convers Manag 114:234–240
Article Google Scholar
Bernard A, Bos-Levenbach E (1953) The plotting of observations on probability-paper. Statistica Neerlandica 7:163–173
Article MathSciNet Google Scholar
Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA
Bowman K, Shenton L (2001) Weibull distributions when the shape parameter is defined. Comput Stat Data Anal 36:299–310
Article MathSciNet Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
MATH Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482
Burnham KP, Anderson DR (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res 33:261–304
Article MathSciNet Google Scholar
Bütikofer L, Stawarczyk B, Roos M (2015) Two regression methods for estimation of a two-parameter Weibull distribution for reliability of dental materials. Dent Mater 31:e33–e50
Article Google Scholar
Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 160–172
Chen H, Cheng W, Mingzhong J (2018) Parameter estimation for generalized logistic distribution by estimating equations based on the order statistics. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2018.1433854
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The second international conference on knowledge discovery and data mining (KDD'96). AAAI, pp 226–231
Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, pp 4–13
Gail M, Gastwirth J (1978) A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic. J R Stat Soc Ser B (Methodological) 40:350–357
MathSciNet MATH Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Huang C (2002) Information diffusion techniques and small-sample problem. Int J Inf Technol Decis Mak 1:229–249
Article Google Scholar
Huang C, Moraga C (2004) A diffusion-neural-network for learning from small samples. Int J Approx Reason 35:137–161
Article MathSciNet Google Scholar
Li DC, Lin LS (2013) A new approach to assess product lifetime performance for small data sets. Eur J Oper Res 230:290–298
Article MathSciNet Google Scholar
Li DC, Lin LS (2014) Generating information for small data sets with a multi-modal distribution. Decis Support Syst 66:71–81
Article Google Scholar
Li DC, Wu CS, Tsai T-I, Lina Y-S (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982
Article Google Scholar
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: IEEE symposium on computational intelligence and data mining (CIDM). pp 104–111
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 14. Oakland, CA, USA. pp 281–297
Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, Dordrecht
Book Google Scholar
Niyogi P, Girosi F, Poggio T (1998) Incorporating prior information in machine learning by creating virtual examples. Proc IEEE 86:2196–2209
Article Google Scholar
Pai P-F (2006) System reliability forecasting by support vector machines with genetic algorithms. Math Comput Model 43:262–274
Article MathSciNet Google Scholar
Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90
Article Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering Information. Sciences 291:184–203
Google Scholar
Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS) 42:19
Article MathSciNet Google Scholar
Sezer EA, Nefeslioglu HA, Gokceoglu C (2014) An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Appl Soft Comput 24:126–134
Article Google Scholar
Shao C, Song X, Yang X, Wu X (2016) Extended minimum-squared error algorithm for robust face recognition via auxiliary mirror samples. Soft Comput 20:3177–3187
Article Google Scholar
Song X, Shao C, Yang X, Wu X (2017) Sparse representation-based classification using generalized weighted extended dictionary. Soft Comput 21:4335–4348
Article Google Scholar
Student (1908) The probable error of a mean. Biometrika 6:1–25
Article Google Scholar
Tang D, Zhu N, Yu F, Chen W, Tang T (2014) A novel sparse representation method based on virtual samples for face recognition. Neural Comput Appl 24:513–519
Article Google Scholar
Yang J, Yu X, Xie Z-Q, Zhang J-P (2011) A novel virtual sample generation method based on Gaussian distribution. Knowl Based Syst 24:740–748
Article Google Scholar
Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
Article Google Scholar
Zhou J, Duan B, Huang J, Li N (2015) Incorporating prior knowledge and multi-kernel into linear programming support vector regression. Soft Comput 19:2047–2061
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan, 70101, Taiwan, ROC
Der-Chiang Li, Chien-Chih Chen & Wei-Hao Yu
Information and Communications Research Laboratories, Industrial Technology Research Institute, Chung Hsing Road, Chutung, Hsinchu, 31040, Taiwan, ROC
Liang-Sian Lin

Authors

Der-Chiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Liang-Sian Lin
View author publications
You can also search for this author in PubMed Google Scholar
Chien-Chih Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Hao Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang-Sian Lin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, DC., Lin, LS., Chen, CC. et al. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Comput 23, 11883–11900 (2019). https://doi.org/10.1007/s00500-018-03744-z

Download citation

Published: 08 January 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s00500-018-03744-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using virtual samples to improve learning performance for small datasets with multimodal distributions

Abstract

Access this article

Similar content being viewed by others

Building robust models for small data containing nominal inputs and continuous outputs based on possibility distributions

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using virtual samples to improve learning performance for small datasets with multimodal distributions

Abstract

Access this article

Similar content being viewed by others

Building robust models for small data containing nominal inputs and continuous outputs based on possibility distributions

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation