Abstract
Small sample size (SSS) problems pose a tremendous challenge in modeling tasks due to insufficient training samples, especially in process industry where thousands of useless samples overwhelm very limited valuable samples, leading to deterioration on the prediction ability of trained models for key variables. In this study, the prediction ability to forecast models is enhanced by generating virtual samples. Considering the integrated effects of attributes, a new data augment approach, called ITNN-VSG, which integrates virtual sample generation (VSG) with input-training neural network (ITNN), was put forward to enlarge training datasets for improving the performance of forecasting models. In the absence of any available domain-specific knowledge about target models, a query-driven interpolation process was first developed to explore the overall tendency of data distribution in both sparse regions and dense regions. Second, an ITNN with fixed weights was used to calculate the input corresponding to the virtual output generated by the interpolation process. To validate the effectiveness of the proposed approach, several in silico experiments were carried out on a benchmark dataset from sinc(x) function, followed by a real-world application to purified terephthalic acid (PTA) solvent system. The experimental results demonstrated that the proposed approach outperformed other existing approaches such as mega-trend-diffusion and tree-based-trend-diffusion.
Similar content being viewed by others
References
Bayar B, Bouaynaya N, Shterenberg R (2017) SMURC: high-dimension small-sample multivariate regression with covariance estimation. IEEE J Biomed Health Inform 21:573–581
Blaes S, Burwick T (2017) Few-shot learning in deep networks through global prototyping. Neural Netw 94:159–172
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49:1–50
Chen J (2018) The quadrilateral Mindlin plate elements using the spline interpolation bases. J Comput Appl Math 329:68–83
Chen ZS, Zhu B, He YL, Yu LA (2017) A PSO based virtual sample generation method for small sample sets: Applications to regression datasets. Eng Appl Artif Intell 59:236–243
Dias LS, Ierapetritou MG (2016) Integration of scheduling and control under uncertainties: review and challenges. Chem Eng Res Des 116:98–113
Diez-Olivan A, Del Ser J, Galar D, Sierra B (2019) Data fusion and machine learning for industrial prognosis: trends and perspectives towards Industry 4.0. Inf Fus 50:92–111
Espezua S, Villanueva E, Maciel CD, Carvalho A (2015) A projection pursuit framework for supervised dimension reduction of high dimensional small sample datasets. Neurocomputing 149:767–776
Gong HF, Chen ZS, Zhu QX, He YL (2017) A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: an empirical study of petrochemical industries. Appl Energy 197:405–415
He YL, Wang PJ, Zhang MQ, Zhu QX, Xu Y (2018) A novel and effective nonlinear interpolation virtual sample generation method for enhancing energy prediction and analysis on small data problem: a case study of Ethylene industry. Energy 147:418–427
Hong SH, Wang L, Truong TK (2018) Low-complexity direct computation algorithm for cubic-spline interpolation scheme. J Vis Commun Image Represent 50:159–166
Huang S et al (2013) A sparse structure learning algorithm for Gaussian Bayesian Network identification from high-dimensional data. IEEE Trans Pattern Anal Mach Intell 35:1328–1342
Lee Y, Kang J, Kang B, Ryu KR (2006) Bayesian sampling of virtual examples to improve classification accuracy. In: SICE-ICASE International Joint Conference, IEEE, Busan, South Korea, pp 1009–1014. http://doi.org/https://doi.org/10.1109/SICE.2006.315740
Li DC, Chen CC, Chang CJ, Lin WK (2012) A tree-based-trend-diffusion prediction procedure for small sample sets in the early stages of manufacturing systems. Expert Syst Appl 39:1575–1581
Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982
Li DC, Lin LS (2014) Generating information for small data sets with a multi-modal distribution. Decis Support Syst 66:71–81
Li DC, Lin LS, Peng LJ (2014) Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency. Decis Support Syst 59:286–295
Li DC, Lin WK, Chen CC, Chen HY, Lin LS (2018) Rebuilding sample distributions for small dataset learning. Decis Support Syst 105:66–76
Liu Y, Zhou Y, Liu X, Dong F, Wang C, Wang Z (2019) Wasserstein GAN-based small-sample augmentation for new-generation artificial intelligence: a case study of cancer-staging data in biology. Engineering 5:156–163
Martin-Diaz I, Morinigo-Sotelo D, Duque-Perez O, Romero-Troncoso RD (2017) Early fault detection in induction motors using adaboost with imbalanced small data and optimized sampling. IEEE Trans Ind Appl 53:3066–3075
Niyogi P, Girosi F, Poggio T (1998) Incorporating prior information in machine learning by creating virtual examples. Proc IEEE 86:2196–2209
Ohashi T, Watanabe H, Tokuno J, Katagiri S, Ohsaki M, Matsuda S, Kashioka H (2012) Increasing virtual samples through loss smoothness determination in large geometric margin minimum classification error training. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Kyoto, Japan, pp 2081–2084. http://doi.org/https://doi.org/10.1109/ICASSP.2012.6288320
Qin SJ, Chiang LH (2019) Advances and opportunities in machine learning for process data analytics. Comput Chem Eng 126:465–473
Reuter C, Brambring F, Weirich J, Kleines A (2016) Improving data consistency in production control by adaptation of data mining algorithms. Procedia CIRP 56:545–550
Rodriguez-Amigo MC, Diez-Mediavilla M, Gonzalez-Pena D, Perez-Burgos A, Alonso-Tristan C (2017) Mathematical interpolation methods for spatial estimation of global horizontal irradiation in Castilla-Leon, Spain: A case study. Sol Energy 151:14–21
Saez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Tan SF, Mavrovouniotis ML (1995) Reducing data dimensionality through optimizing neural-network inputs. AIChE J 41:1471–1480
Tang J, Jia M, Liu Z, Chai T, Yu W (2015) Modeling high dimensional frequency spectral data based on virtual sample generation technique. In: IEEE International Conference on Information and Automation, IEEE, Lijiang, China, pp 1090–1095. http://doi.org/https://doi.org/10.1109/ICInfA.2015.7279449
Tulsyan A, Garvin C, Undey C (2018) Advances in industrial biopharmaceutical batch process monitoring: Machine-learning methods for small data problems. Biotechnol Bioeng 115:1915–1924
Van Gorp J, Rolain Y (2000) An interpolation technique for learning with sparse Data. IFAC Proc Vol 33:73–78
Zhang Y, Ling C (2018) A strategy to apply machine learning to small datasets in materials science. NPJ Comput Mater 4:25
Zhao Y, Ma R, Wen X (2011) Construct virtual samples for improving kernel PCA. In: International Conference on Multimedia and Signal Processing, IEEE, Guilin, China, pp 325–328. http://doi.org/https://doi.org/10.1109/CMSP.2011.72
Zhu B, Chen ZS, He YL, Yu LA (2017a) A novel nonlinear functional expansion based PLS (FEPLS) and its soft sensor application. Chemom Intell Lab Syst 161:108–117
Zhu FY, Ma ZY, Li XX, Chen G, Chien JT, Xue JH, Guo J (2019) Image-text dual neural network with decision strategy for small-sample image classification. Neurocomputing 328:182–188
Zhu JL, Ge ZQ, Song ZH, Gao FR (2018) Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data. Annu Rev Control 46:107–133
Zhu Q, Chen Z, Zhang X, Abbas R, Xu Y, Chen Y (2020) Dealing with small sample size problems in process industry using virtual sample generation: a Kriging-based approach. Soft Comput 24(9):6889–6902
Zhu QX, Gong HF, Xu Y, He YL (2017) A bootstrap based virtual sample generation method for improving the accuracy of modeling complex chemical processes using small datasets. In: 6th Data Driven Control and Learning Systems, IEEE, Chongqing, China. http://doi.org/https://doi.org/10.1109/DDCLS.2017.8068049
Zhu QX, Li CF (2006) Dimensionality reduction with input training neural network and its application in chemical process modelling. Chin J Chem Eng 14:597–603
Acknowledgements
Many thanks to Andy Koswara, Botond Szilagyi, Kanjakha Pal at Davidson School of Chemical Engineering in Purdue University, for invaluable discussions and advice. This research was partly funded by National Natural Science Foundation of China (Grant Nos. 61973024, 61973022, and 61703027), the China Scholarship Council State-Sponsored Scholarship Program (Grant Nos. 201806880024, 201806885004), the Fundamental Research Funds for the Central Universities under Grant Nos. JD1808 and the Open Research Fund of State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (Grant No.18I01).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
No individual participants are included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, ZS., Zhu, QX., Xu, Y. et al. Integrating virtual sample generation with input-training neural network for solving small sample size problems: application to purified terephthalic acid solvent system. Soft Comput 25, 6489–6504 (2021). https://doi.org/10.1007/s00500-021-05641-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-05641-4