SPXYE: an improved method for partitioning training and validation sets

Gao, Ting; Hu, Lina; Jia, Zhizhen; Xia, Tianna; Fang, Chao; Li, Hongzhi; Hu, LiHong; Lu, Yinghua; Li, Hui

doi:10.1007/s10586-018-1877-9

SPXYE: an improved method for partitioning training and validation sets

Published: 19 February 2018

Volume 22, pages 3069–3078, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Ting Gao¹,
Lina Hu¹,
Zhizhen Jia¹,
Tianna Xia¹,
Chao Fang¹,
Hongzhi Li¹,
LiHong Hu ORCID: orcid.org/0000-0003-3792-2917¹,
Yinghua Lu¹ &
…
Hui Li¹

477 Accesses
15 Citations
Explore all metrics

Abstract

This study aimed to propose a sample selection strategy termed SPXYE (sample set partitioning based on joint X–Y–E distances) for data partition in multivariate modeling, where training and validation sets are required. This method was applied to choose the training set according to X (the independent variables), Y (the dependent variables), and E (the error of the preliminarily calculated results with the dependent variables) spaces. This selection strategy provided a valuable tool for multivariate calibration. The proposed technique SPXYE was applied to three household chemical molecular databases to obtain training and validation sets for partial least squares (PLS) modeling. For comparison, the training and validation sets were also generated using random sampling, Kennard–Stone, and sample set partitioning based on joint X–Y distances methods. The predictions of all associated PLS regression models were performed upon the same testing set, which was different from either the training set or the validation set. The results indicated that the proposed SPXYE strategy might serve as an alternative partition strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Article Open access 01 July 2018

Pairwise Data Clustering Accompanied by Validation and Visualisation

Effects of Resampling in Determining the Number of Clusters in a Data Set

Article 16 July 2019

References

Allegrini, F., Olivier, A.C.: An integrated approach to the simultaneous selection of variables, mathematical pre-processing and calibration samples in partial least-squares multivariate calibration. Talanta 115, 755–760 (2013)
Article Google Scholar
Lorber, A., Kowalski, B.R.: The effect of interferences and calbiration design on accuracy: implications for sensor and sample selection. J. Chemom. 2(1), 67–79 (1988)
Article Google Scholar
Kocjančič, R., Zupan, J.: Modelling of the river flowrate: the influence of the training set selection. Chemom. Intell. Lab. Syst. 54(1), 21–34 (2000)
Article Google Scholar
Jia, R.D., Mao, Z.Z., Chang, Y.Q., Zhang, S.-N.: Kernel partial robust M-regression as a flexible robust nonlinear modeling technique. Chemom. Intell. Lab. Syst. 100(2), 91–98 (2010)
Article Google Scholar
Westad, F., Marini, F.: Validation of chemometric models—a tutorial. Anal. Chim. Acta 893, 14–24 (2015)
Article Google Scholar
Ferre, J., Rius, F.X.: Selection of the best calibration sample subset for multivariate regression. Anal. Chem. 68(9), 1565–1571 (1996)
Article Google Scholar
Hu, Y., Peng, S., Bi, Y., Tang, L.: Calibration transfer based on maximum margin criterion for qualitative analysis using Fourier transform infrared spectroscopy. Analyst 137(24), 5913–5918 (2012)
Article Google Scholar
Filho, H.A.D., Galvão, R.K.H., Araújo, M.C.U., et al.: A strategy for selecting calibration samples for multivariate modelling. Chemom. Intell. Lab. Syst. 72(1), 83–91 (2004)
Article Google Scholar
Capitán-Vallvey, L.F., Navas, N., Del Olmo, M., Consonni, V., Todeschini, R.: Resolution of mixtures of three nonsteroidal anti-inflammatory drugs by fluorescence using partial least squares multivariate calibration with previous wavelength selection by Kohonen artificial neural networks. Talanta 52(6), 1069–1079 (2000)
Article Google Scholar
Rajer-Kanduč, K., Zupan, J., Majcen, N.: Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemom. Intell. Lab. Syst. 65(2), 221–229 (2003)
Article Google Scholar
Kennard, R.W., Stone, L.A.: computer aided design of experiments. Technometrics 11(1), 137 (1969)
Article MATH Google Scholar
Galvão, R.K.H., Araujo, M.C.U., José, G.E., et al.: A method for calibration and validation subset partitioning. Talanta 67(4), 736–740 (2005)
Article Google Scholar
Wu, W., Walczak, B., Massart, D.L., et al.: Artificial neural networks in classification of NIR spectral data: design of the training set. Chemom. Intell. Lab. Syst. 33(1), 35–46 (1996)
Article Google Scholar
Groot, P.J., Postma, G.J., Melssen, W.J., Buydens, L.M.C.: Selecting a representative training set for the classification of demolition waste using remote NIR sensing. Anal. Chim. Acta 392(1), 67–75 (1999)
Article Google Scholar
Galvão, R.K.H., José, G.E.D., Filho, H.A.D., et al.: Optimal wavelet filter construction using X and Y data. Chemom. Intell. Lab. Syst. 70(1), 1–10 (2004)
Article Google Scholar
Shamsipur, M., Zare-Shahabadi, V., Hemmateenejad, B., Akhond, M.: Ant colony optimisation: a powerful tool for wavelength selection. J. Chemom. 20(3–4), 146–157 (2006)
Article Google Scholar
Liu, W., Zhao, Z., Yuan, H., et al.: An optimal selection method of samples of calibration set and validation set for spectral multivariate analysis. Spectrosc. Spectr. Anal. 34(4), 947–951 (2014)
Google Scholar
He, Z., Li, M., Ma, Z.: Design of a reference value-based sample-selection method and evaluation of its prediction capability. Chemom. Intell. Lab. Syst. 148, 72–76 (2015)
Article Google Scholar
Gani, W., Limam, M.: A kernel distance-based representative subset selection method. J. Stat. Comput. Simul. 86(1), 135–148 (2016)
Article MathSciNet Google Scholar
Chen, W.R., Yun, Y.H., Wen, M., et al.: Representative subset selection and outlier detection via isolation forest. Anal. Methods 8(39), 7225–7231 (2016)
Article Google Scholar
Shao, X.G., Bian, X.H., Cai, W.S.: An improved boosting partial least squares method for near-infrared spectroscopic quantitative. Anal. Chim. Acta 666, 32–37 (2010)
Article Google Scholar
Li, Y.K., Jing, J.: A consensus PLS method based on diverse wavelength variables models for analysis of near-infrared spectra. Chemom. Intell. Lab. 130, 45–49 (2014)
Article Google Scholar
Gao, T., Shi, L.L., Li, H.B., et al.: Improving the accuracy of low level quantum chemical calculation for absorption energies: the genetic algorithm and neural network approach. Phys. Chem. Chem. Phys. 11(25), 5124–5129 (2009)
Article Google Scholar
Gao, T., Sun, S.L., Shi, L.L., et al.: An accurate density functional theory calculation for electronic excitation energies: the least-squares support vector machine. J. Chem. Phys. 130(18), 184104 (2009)
Article Google Scholar
Gao, T., Li, H., Li, W., Li, L., Fang, C., Li, H., et al.: A machine learning correction for DFT non-covalent interactions based on the S22 S66 and X40 benchmark databases. J. Cheminform. 8, 24 (2016)
Article Google Scholar
Li, H.Z., Tao, W., Gao, T., et al.: Improving the Accuracy of Density Functional theory (DFT) calculation for homolysis bond dissociation energies of Y-NO bond: generalized regression neural network based on grey relational analysis and principal component analysis. Int. J. Mol. Sci. 12(4), 2242–2261 (2011)
Article Google Scholar
Jurecka, P., Sponer, J., Cerny, J., Hobza, P.: Benchmark database of accurate (MP2 and CCSD(T) complete basis set limit) interaction energies of small model complexes, DNA base pairs, and amino acid pairs. Phys. Chem. Chem. Phys. 8(17), 1985–1993 (2006)
Article Google Scholar
Rezac, J., Riley, K.E., Hobza, P.: S66: a well-balanced database of benchmark interaction energies relevant to biomolecular structures. J. Chem. Theory Comput. 7(8), 2427–2438 (2011)
Article Google Scholar
Rezac, J., Riley, K.E., Hobza, P.: Benchmark calculations of noncovalent interactions of halogenated molecules. J. Chem. Theory Comput. 8(11), 4285–4292 (2012)
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge financial support from NSFC (21473025 and 21131001), the Science and Technology Development Planning of Jilin Province (20150204041GX and 20130522109JH), and the Education Projects of Jilin Province (2015552, 2014B045, 2015553 and 2015556).

Author information

Authors and Affiliations

School of Information Science and Technology, Northeast Normal University, Changchun, 130117, China
Ting Gao, Lina Hu, Zhizhen Jia, Tianna Xia, Chao Fang, Hongzhi Li, LiHong Hu, Yinghua Lu & Hui Li

Authors

Ting Gao
View author publications
You can also search for this author in PubMed Google Scholar
Lina Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhizhen Jia
View author publications
You can also search for this author in PubMed Google Scholar
Tianna Xia
View author publications
You can also search for this author in PubMed Google Scholar
Chao Fang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Li
View author publications
You can also search for this author in PubMed Google Scholar
LiHong Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yinghua Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

TG designed the study. LNH and ZZJ executed the jobs on a computer. TNX, CF and HZL provided help with the study design. TG and LHH drafted the manuscript. YHL and HL supervised the study. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to LiHong Hu or Hui Li.

Appendix: Matlab implementation of the proposed SPXYE algorithm

In this Matlab function, X, y and e are the parameter matrix (independent variables), the experiment value matrix (dependent variable) and the error matrix (the errors between the experiment values and calculated values), respectively. Ncal is the number of samples to be selected for the training set. The indexes of the selected samples are returned in vector m.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, T., Hu, L., Jia, Z. et al. SPXYE: an improved method for partitioning training and validation sets. Cluster Comput 22 (Suppl 2), 3069–3078 (2019). https://doi.org/10.1007/s10586-018-1877-9

Download citation

Received: 28 November 2017
Revised: 26 December 2017
Accepted: 16 January 2018
Published: 19 February 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s10586-018-1877-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPXYE: an improved method for partitioning training and validation sets

Abstract

Access this article

Similar content being viewed by others

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Pairwise Data Clustering Accompanied by Validation and Visualisation

Effects of Resampling in Determining the Number of Clusters in a Data Set

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Appendix: Matlab implementation of the proposed SPXYE algorithm

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SPXYE: an improved method for partitioning training and validation sets

Abstract

Access this article

Similar content being viewed by others

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Pairwise Data Clustering Accompanied by Validation and Visualisation

Effects of Resampling in Determining the Number of Clusters in a Data Set

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Appendix: Matlab implementation of the proposed SPXYE algorithm

Appendix: Matlab implementation of the proposed SPXYE algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation