Elsevier

Neurocomputing

Volume 240, 31 May 2017, Pages 183-190
Neurocomputing

Optimization enhanced genetic algorithm-support vector regression for the prediction of compound retention indices in gas chromatography

https://doi.org/10.1016/j.neucom.2016.11.070Get rights and content

Abstract

A new method using genetic algorithm and support vector regression with parameter optimization (GA–SVR–PO) was developed for the prediction of compound retention indices (RI) in gas chromatography. The dataset used in this work consists of 252 compounds extracted from the Molecular Operating Environment (MOE) boiling point database. Molecular descriptors were calculated by descriptor tools of the MOE software package. After removing redundant descriptors, 151 descriptors were obtained for each compound. A genetic algorithm (GA) was used to select the best subset of molecular descriptors and the best parameters of SVR to optimize the prediction performance of compound retention indices. A 10-fold cross-validation method was used to evaluate the prediction performance. We compared the performance of our proposed model with three existing methods: GA coupled with multiple linear regression (GA–MLR), the subset selected by GA–MLR used to train SVR (GA–MLR–SVR), and GA on SVR (GA–SVR). The experimental results demonstrate that our proposed GA–SVR–PO model has better predictive performance than other existing models with R2 > 0.967 and RMSE = 49.94. The prediction accuracy of GA–SVR–PO model is 96% at 10% of prediction variation.

Introduction

Gas chromatography coupled with mass spectrometry (GC–MS) is a powerful analytical platform for the identification and quantification of small molecules in chemistry and biomedical research. A GC–MS system measures the retention time and mass spectrum of each molecule. Currently, the National Institute of Standards and Technology (NIST) MS database (NIST/EPA/NIH Mass Spectral Library) is widely used for molecular identification using automated mass spectral deconvolution and identification system (AMDIS). AMDIS identify molecules based on the spectrum similarity between the experimental mass spectrum and the mass spectrum recorded in the NIST MS library [1].

Retention time is a measure of the interactions between a molecule and the stationary phase of the GC column. Therefore, the molecular retention time in GC is actually correlated to the molecular structure. Unfolding such inherent relation between the molecular retention time and molecular structure will significantly benefit not only the understanding of gas phase chemistry, but also the molecular identification in metabolomics and other research fields. This is often done by converting the retention time into the retention index (RI). The RI of a molecule is its retention time normalized to the retention times of adjacently eluting n-alkanes, which can be achieved by either an internal or external calibration experiment. While retention times vary with the individual chromatographic system, the derived retention indices are quite independent of chromatographic parameters and allow comparing values measured by different analytical laboratories under varying conditions. The Kovats RI is used for isothermal experiments [2] and linear RI is designed for temperature gradient experiments [3]. However, the current experimental RI data are very limited compared to the mass spectral data recorded in the NIST MS library. There are only 21,940 molecules that have the RI information even though the NIST MS library contains mass spectra for 192,108 molecules. In order to employ RI as a match factor for metabolite identification, it is necessary to theoretically predict the molecular RI values for the molecules that do not have experimental RI information.

Quantitative structure–retention relationship (QSRR) model has been used to estimate the molecular RI values according to the molecular descriptors generated from the chemical structure [4], [5], [6]. The success of a QSRR model depends on the accuracy of input RI data, the selection of appropriate molecular descriptors, and the statistical tools for retention indices prediction. Most QSRR studies focus on the selection of suitable statistical tools. The developed methods for creating a QSRR model include multiple linear regression (MLR) [7], [8], partial least squares (PLS) [9], [10], artificial neural network (ANN) [11], [12], [13], [14], radial basis function (RBF) neural network [15], random forest (RF) [16], and support vector regression (SVR) [17], [18].

There is not much work done to investigate the impact of the selection methods of the molecular descriptors on the performance of retention indices prediction. Hancock et al. compared the predication performance of multiple data mining techniques and found that GA [19] plus MLR achieved better performance than others [20]. The optimal descriptors selected by the GA–MLR have been employed to train the SVR for retention indices prediction (GA–MLR–SVR) [21]. However, the GA–MLR method only selects molecular descriptors that have linear correlation with the retention indices. The molecular descriptors having non-linear relationship with the retention index are excluded. On the other hand, the use of SVR requires users to tune the SVR parameters. The determination of the optimal SVR parameters usually is a time consuming method such as grid search [22]. To address this problem, Ustun et al. used GA and a simplex optimization to determine the optimal SVR parameters [23], but they did not use the optimization algorithm to select optimal subset of molecular descriptors. Lin et al. used the simulated annealing algorithm to select the optimal features and the parameters of support vector machine (SVM) for classification problems [24]. However, this research work was not developed for regression problem. To our knowledge, the GA has not yet been used to search the optimal parameters of SVR and the optimal subset of molecular descriptors simultaneously for the RI prediction.

To develop a QSRR model that could predict molecular retention indices in gas chromatography more accurate, we present an algorithm combining genetic algorithm and support vector regression with parameter optimization method (GA–SVR–PO). The dataset used in this work were extracted from the Molecular Operating Environment (MOE) boiling point database [25] and the true RI values of molecules were extracted from the NIST RI08 library, followed by analysis of the prediction performance of the proposed GA–SVR–PO method. The performance of the GA–SVR–PO method was compared with other three existing methods: GA–MLR, GA–MLR–SVR, and GA–SVR. The experimental results confirm the effectiveness of our proposed approach.

Section snippets

Experimental RI data

Previous study demonstrated that there is a strong correlation between the boiling point (BP) and the RI of a molecule [26]. Therefore, 252 molecules with BP information in the Molecular Operating Environment (MOE) database are used as our research subject for the QSRR model construction and testing in this work [25]. We first extracted the experimental RI values of these compounds acquired on non-polar columns from the NIST08 RI library [27]. It should be noted that some compounds have

Regression models

Several regression models have been proposed for the RI prediction in previous works. Among them, the MLR model is the most popular one. Another important model is SVR.

Results and discussion

In this study, we developed a GA–SVR–PO method for the prediction of molecular retention indices in gas chromatography. The performance of our GA–SVR–PO model was compared with the performance of other three existing models: GA–MLR, GA–MLR–SVR, and GA–SVR. In the GA–MLR model, the altered RMSE of validation set based on MLR was used as fitness of GA to find the optimal subset of molecular descriptors. The GA–MLR–SVR is a model that uses the optimal molecular descriptors found by the GA–MLR

Conclusions

In this study, we developed a genetic algorithm and support vector regression with parameter optimization model (GA–SVR–PO) for the prediction of molecular retention indices in gas chromatography. The performance of our proposed GA–SVR–PO model was compared with the performance of other three existing models: GA–MLR, GA–MLR–SVR, and GA–SVR. Our analyses show that the MLR-based models can achieve a desired performance and the SVR-based models have improved performance. The SVR-based models also

Acknowledgments

This work was supported by National Natural Science Foundation of China under grant nos. 61271098, 61672035, 61300058, 61472282 and 61032007 and Provincial Natural Science Research Program of Higher Education Institutions of Anhui Province under grant no. KJ2012A005, Anhui Provincial Natural Science Foundation under grant no. 1508085MF129.

Jun Zhang was born in Anhui Province, Chin, in 1971. He received M.S. degree in Pattern Recognition and Intelligent System in 2004, from Institute of Intelligent Machines, Chinese Academy of Sciences. He received the Ph.D. degree from University of Science and Technology of China, Hefei, China in 2007. Currently, He is associate professor in the School of Electrical Engineering and Automation, Anhui University, China. His research interests focus on deep learning, ensemble learning and

References (35)

  • LinS.W. et al.

    Parameter determination of support vector machine and feature selection using simulated annealing approach

    Appl. Soft Comput.

    (2008)
  • W.P. Eckel et al.

    Use of boiling point-Lee retention index correlation for rapid review of gas chromatography-mass spectrometry data

    Anal. Chim. Acta

    (2003)
  • R. Todeschini et al.

    Detecting "bad" regression models: multicriteria fitness functions in regression analysis

    Anal. Chim. Acta

    (2004)
  • E. Kováts

    Gas-chromatographische charakterisierung organischer verbindungen. Teil 1: retentionsindices aliphatischer halogenide, alkohole, aldehyde und ketone

    Helv. Chim. Acta

    (1958)
  • R. Kaliszan

    Quantitative Structure-Chromatographic Retention Relationships

    (1987)
  • E. Dossin et al.

    Prediction models of retention indices for increased confidence in structural elucidation during complex matrix analysis: application to gas chromatography coupled with high-resolution mass spectrometry

    Anal. Chem.

    (2016)
  • K. Heberger et al.

    Partial least squares modeling of retention data of oxo compounds in gas chromatography

    Chromatographia

    (2000)
  • Cited by (24)

    • Improving grasshopper optimization algorithm for hyperparameters estimation and feature selection in support vector regression

      2021, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      In the literature, there several attempts with different procedures to improve the SVR performance by appropriated choosing of these hyperparameters [28–30,51–53]. Nature-inspired algorithms are among these different procedures that were employed to select the hyperparameters of SVR [30–39,54]. However, in all these existence procedures regarding the selection of hyperparameters, there is no attempt to perform feature selection simultaneously.

    • How to identify “Material basis–Quality markers” more accurately in Chinese herbal medicines from modern chromatography-mass spectrometry data-sets: Opportunities and challenges of chemometric tools

      2021, Chinese Herbal Medicines
      Citation Excerpt :

      Some software packages have been written to predict the retention of different chemicals in different columns, e.g, ACD/ChromGenius (http://www.acdlabs.com/products/com_iden/meth_dev/chromgen/). Many methods have been used for QSRR methodologies, e.g, SVM (Luan et al., 2005), random forests (RF) (Goudarzi, Shahsavani, Emadi-Gandaghi, & Arab Chamjangali, 2014), monte Carlo method (Veselinović et al., 2017), genetic algorithm (Zhang, Zheng, Xia et al., 2017), deep learning (Matyushin, Sholokhova, & Buryak, 2019). It's worth mentioning that the data sources and model size will affect the accuracy in the modeling processes.

    • Effect of input variables on cooling load prediction accuracy of an office building

      2018, Applied Thermal Engineering
      Citation Excerpt :

      The Gaussian kernel function [28] is used in this study. Therefore, the prediction accuracy of SVM models depends on three parameters (C, γ and ε), which can be optimized and obtained using genetic algorithm (GA) [29–31]. The MATLAB 2014 software is used to conduct SVM and GA methods.

    • Feature selection method based on support vector machine and shape analysis for high-throughput medical data

      2017, Computers in Biology and Medicine
      Citation Excerpt :

      Based on the results of the model, doctors determine the type and period of the tumor and ultimately give the appropriate treatment advice. The methods of mass spectrometry data analysis are the iterative search method [1], genetic algorithm [2], and chi-squared test [3]. Although the accuracy of these methods is high, the specificity is low.

    • Research on short-term and ultra-short-term cooling load prediction models for office buildings

      2017, Energy and Buildings
      Citation Excerpt :

      The population size is set to 20 and the maximum degree of evolution generation is set to 100, thus reducing the iteration time by obtaining the global optimal solution. The empirical searching bound for parameter C, parameter γ and parameter ε are [0.01, 100], [0.01, 100], [0.001, 1] respectively [40–42]. The range of crossover probability is usually set to 0.6–0.9, and thus the greater the probability of crossover, the faster new elements become integrated into the population.

    View all citing articles on Scopus

    Jun Zhang was born in Anhui Province, Chin, in 1971. He received M.S. degree in Pattern Recognition and Intelligent System in 2004, from Institute of Intelligent Machines, Chinese Academy of Sciences. He received the Ph.D. degree from University of Science and Technology of China, Hefei, China in 2007. Currently, He is associate professor in the School of Electrical Engineering and Automation, Anhui University, China. His research interests focus on deep learning, ensemble learning and cheminformatics.

    View full text