Skip to main content
Log in

SPXYE: an improved method for partitioning training and validation sets

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This study aimed to propose a sample selection strategy termed SPXYE (sample set partitioning based on joint X–Y–E distances) for data partition in multivariate modeling, where training and validation sets are required. This method was applied to choose the training set according to X (the independent variables), Y (the dependent variables), and E (the error of the preliminarily calculated results with the dependent variables) spaces. This selection strategy provided a valuable tool for multivariate calibration. The proposed technique SPXYE was applied to three household chemical molecular databases to obtain training and validation sets for partial least squares (PLS) modeling. For comparison, the training and validation sets were also generated using random sampling, Kennard–Stone, and sample set partitioning based on joint X–Y distances methods. The predictions of all associated PLS regression models were performed upon the same testing set, which was different from either the training set or the validation set. The results indicated that the proposed SPXYE strategy might serve as an alternative partition strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Allegrini, F., Olivier, A.C.: An integrated approach to the simultaneous selection of variables, mathematical pre-processing and calibration samples in partial least-squares multivariate calibration. Talanta 115, 755–760 (2013)

    Article  Google Scholar 

  2. Lorber, A., Kowalski, B.R.: The effect of interferences and calbiration design on accuracy: implications for sensor and sample selection. J. Chemom. 2(1), 67–79 (1988)

    Article  Google Scholar 

  3. Kocjančič, R., Zupan, J.: Modelling of the river flowrate: the influence of the training set selection. Chemom. Intell. Lab. Syst. 54(1), 21–34 (2000)

    Article  Google Scholar 

  4. Jia, R.D., Mao, Z.Z., Chang, Y.Q., Zhang, S.-N.: Kernel partial robust M-regression as a flexible robust nonlinear modeling technique. Chemom. Intell. Lab. Syst. 100(2), 91–98 (2010)

    Article  Google Scholar 

  5. Westad, F., Marini, F.: Validation of chemometric models—a tutorial. Anal. Chim. Acta 893, 14–24 (2015)

    Article  Google Scholar 

  6. Ferre, J., Rius, F.X.: Selection of the best calibration sample subset for multivariate regression. Anal. Chem. 68(9), 1565–1571 (1996)

    Article  Google Scholar 

  7. Hu, Y., Peng, S., Bi, Y., Tang, L.: Calibration transfer based on maximum margin criterion for qualitative analysis using Fourier transform infrared spectroscopy. Analyst 137(24), 5913–5918 (2012)

    Article  Google Scholar 

  8. Filho, H.A.D., Galvão, R.K.H., Araújo, M.C.U., et al.: A strategy for selecting calibration samples for multivariate modelling. Chemom. Intell. Lab. Syst. 72(1), 83–91 (2004)

    Article  Google Scholar 

  9. Capitán-Vallvey, L.F., Navas, N., Del Olmo, M., Consonni, V., Todeschini, R.: Resolution of mixtures of three nonsteroidal anti-inflammatory drugs by fluorescence using partial least squares multivariate calibration with previous wavelength selection by Kohonen artificial neural networks. Talanta 52(6), 1069–1079 (2000)

    Article  Google Scholar 

  10. Rajer-Kanduč, K., Zupan, J., Majcen, N.: Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemom. Intell. Lab. Syst. 65(2), 221–229 (2003)

    Article  Google Scholar 

  11. Kennard, R.W., Stone, L.A.: computer aided design of experiments. Technometrics 11(1), 137 (1969)

    Article  MATH  Google Scholar 

  12. Galvão, R.K.H., Araujo, M.C.U., José, G.E., et al.: A method for calibration and validation subset partitioning. Talanta 67(4), 736–740 (2005)

    Article  Google Scholar 

  13. Wu, W., Walczak, B., Massart, D.L., et al.: Artificial neural networks in classification of NIR spectral data: design of the training set. Chemom. Intell. Lab. Syst. 33(1), 35–46 (1996)

    Article  Google Scholar 

  14. Groot, P.J., Postma, G.J., Melssen, W.J., Buydens, L.M.C.: Selecting a representative training set for the classification of demolition waste using remote NIR sensing. Anal. Chim. Acta 392(1), 67–75 (1999)

    Article  Google Scholar 

  15. Galvão, R.K.H., José, G.E.D., Filho, H.A.D., et al.: Optimal wavelet filter construction using X and Y data. Chemom. Intell. Lab. Syst. 70(1), 1–10 (2004)

    Article  Google Scholar 

  16. Shamsipur, M., Zare-Shahabadi, V., Hemmateenejad, B., Akhond, M.: Ant colony optimisation: a powerful tool for wavelength selection. J. Chemom. 20(3–4), 146–157 (2006)

    Article  Google Scholar 

  17. Liu, W., Zhao, Z., Yuan, H., et al.: An optimal selection method of samples of calibration set and validation set for spectral multivariate analysis. Spectrosc. Spectr. Anal. 34(4), 947–951 (2014)

    Google Scholar 

  18. He, Z., Li, M., Ma, Z.: Design of a reference value-based sample-selection method and evaluation of its prediction capability. Chemom. Intell. Lab. Syst. 148, 72–76 (2015)

    Article  Google Scholar 

  19. Gani, W., Limam, M.: A kernel distance-based representative subset selection method. J. Stat. Comput. Simul. 86(1), 135–148 (2016)

    Article  MathSciNet  Google Scholar 

  20. Chen, W.R., Yun, Y.H., Wen, M., et al.: Representative subset selection and outlier detection via isolation forest. Anal. Methods 8(39), 7225–7231 (2016)

    Article  Google Scholar 

  21. Shao, X.G., Bian, X.H., Cai, W.S.: An improved boosting partial least squares method for near-infrared spectroscopic quantitative. Anal. Chim. Acta 666, 32–37 (2010)

    Article  Google Scholar 

  22. Li, Y.K., Jing, J.: A consensus PLS method based on diverse wavelength variables models for analysis of near-infrared spectra. Chemom. Intell. Lab. 130, 45–49 (2014)

    Article  Google Scholar 

  23. Gao, T., Shi, L.L., Li, H.B., et al.: Improving the accuracy of low level quantum chemical calculation for absorption energies: the genetic algorithm and neural network approach. Phys. Chem. Chem. Phys. 11(25), 5124–5129 (2009)

    Article  Google Scholar 

  24. Gao, T., Sun, S.L., Shi, L.L., et al.: An accurate density functional theory calculation for electronic excitation energies: the least-squares support vector machine. J. Chem. Phys. 130(18), 184104 (2009)

    Article  Google Scholar 

  25. Gao, T., Li, H., Li, W., Li, L., Fang, C., Li, H., et al.: A machine learning correction for DFT non-covalent interactions based on the S22 S66 and X40 benchmark databases. J. Cheminform. 8, 24 (2016)

    Article  Google Scholar 

  26. Li, H.Z., Tao, W., Gao, T., et al.: Improving the Accuracy of Density Functional theory (DFT) calculation for homolysis bond dissociation energies of Y-NO bond: generalized regression neural network based on grey relational analysis and principal component analysis. Int. J. Mol. Sci. 12(4), 2242–2261 (2011)

    Article  Google Scholar 

  27. Jurecka, P., Sponer, J., Cerny, J., Hobza, P.: Benchmark database of accurate (MP2 and CCSD(T) complete basis set limit) interaction energies of small model complexes, DNA base pairs, and amino acid pairs. Phys. Chem. Chem. Phys. 8(17), 1985–1993 (2006)

    Article  Google Scholar 

  28. Rezac, J., Riley, K.E., Hobza, P.: S66: a well-balanced database of benchmark interaction energies relevant to biomolecular structures. J. Chem. Theory Comput. 7(8), 2427–2438 (2011)

    Article  Google Scholar 

  29. Rezac, J., Riley, K.E., Hobza, P.: Benchmark calculations of noncovalent interactions of halogenated molecules. J. Chem. Theory Comput. 8(11), 4285–4292 (2012)

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge financial support from NSFC (21473025 and 21131001), the Science and Technology Development Planning of Jilin Province (20150204041GX and 20130522109JH), and the Education Projects of Jilin Province (2015552, 2014B045, 2015553 and 2015556).

Author information

Authors and Affiliations

Authors

Contributions

TG designed the study. LNH and ZZJ executed the jobs on a computer. TNX, CF and HZL provided help with the study design. TG and LHH drafted the manuscript. YHL and HL supervised the study. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to LiHong Hu or Hui Li.

Appendix: Matlab implementation of the proposed SPXYE algorithm

Appendix: Matlab implementation of the proposed SPXYE algorithm

In this Matlab function, X, y and e are the parameter matrix (independent variables), the experiment value matrix (dependent variable) and the error matrix (the errors between the experiment values and calculated values), respectively. Ncal is the number of samples to be selected for the training set. The indexes of the selected samples are returned in vector m.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, T., Hu, L., Jia, Z. et al. SPXYE: an improved method for partitioning training and validation sets. Cluster Comput 22 (Suppl 2), 3069–3078 (2019). https://doi.org/10.1007/s10586-018-1877-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-1877-9

Keywords

Navigation