An improved genetic programming technique for the classification of Raman spectra

https://doi.org/10.1016/j.knosys.2004.10.001Get rights and content

Abstract

The aim of this study is to evaluate the effectiveness of genetic programming relative to that of more commonly-used methods for the identification of components within mixtures of materials using Raman spectroscopy. A key contribution of the genetic programming technique proposed in this research is that it explicitly aims to optimise the certainty levels associated with discovered rules, so as to minimize the chance of misclassification of future samples.

Introduction

Raman spectroscopy may be described as the measurement of the intensity and wavelength of inelastically scattered light from molecules when they are excited by a monochromatic light source. The Raman scattered light occurs at wavelengths that are shifted from the incident light by the energies of molecular vibrations. The analytical applications of Raman spectroscopy continue to grow; typical applications are in structure determination [1], multi-component qualitative analysis and quantitative analysis [2].

Traditionally, multivariate data analysis techniques such as partial least squares (PLS) and principal component regression (PCR) have been used to identify the presence of specific compounds in mixtures from their Raman spectra [2]. However, Raman spectral elucidation suffers from several problems. The presence of fluorescent compounds, impurities, complex mixtures and other environmental and instrumental factors can greatly add to the difficulty in identifying compounds from their spectra [3]. Increasingly, machine learning techniques are being investigated as a possible solution to these problems, as they have been shown to be successful in conjunction with other spectroscopic techniques, such as the use of neural networks to identify bacteria from their infra-red spectra [4] and the application of neural networks to quantification of Fourier transform infra-red (FTIR) spectroscopy data [5]. Schultz et al. [6] used a neural network and PLS to identify individual components in biological mixtures from their Raman spectra, and Benjathapanum et al. [7] used PCR and neural networks to classify ultraviolet–visible spectroscopic data.

In this paper, neural networks, PLS and PCR are compared with the evolutionary technique of genetic programming for predicting which of four solvents are present in a range of mixtures. Genetic programming offers an advantage over neural networks and chemometric methods in this area as the rules generated are interpretable and may be used in isolation or in conjunction with expert opinion to classify spectra.

In combination with the environmental and instrumental problems outlined above, a significant challenge that also arises in other machine learning problems, is in the high sample dimensionality and low sample number commonly found in this area. In many real laboratory applications, it is required to identify materials based on a small number of reference spectra. While commercial spectral databases typically contain spectra for some thousands of materials, they are organised into categories and for individual groups of materials such as the solvents considered here, spectra would be provided for only a small number of mixtures, if any. Machine learning models exhibiting poor generalisation and overfitting to the training data are a consequence of this problem.

In response to this, rather than aiming simply to evolve equations that classify the training data correctly, our approach aims to optimise selection of equations so as to minimize the chance of misclassification of future predicted samples and thereby minimize the problems associated with low sample numbers.

Not many research groups have published applications of genetic programming for the interpretation of spectra. Goodacre [8] discusses the application genetic programming to FTIR spectroscopy image analysis. Using the same genetic programming software, Ellis et al. [9] have quantified the spoilage of meat from its FTIR spectra and Taylor et al. [10] have classified Eubacterium species based on their pyrolysis mass spectra.

Section snippets

Description of task

Raman spectra were recorded on a Labram Infinity (J-Y Horiba) equipped with a liquid nitrogen cooled CCD detector and a 488 nm excitation source. All spectra were recorded at a set interval of ∼400–3340 cm−1 with a resolution of ∼11 cm−1. The liquid samples were held in 1 cm pathlength quartz cuvettes and mounted in a macro sample holder (J-Y Horiba). The macro lens has a focal length of 40 mm, which focuses through the cuvette to the centre of the liquid. The spectral data was not corrected for

Overview

This section outlines the use of standard chemometric techniques and neural networks to identify components in mixtures from their Raman spectra. It then goes on to describe an alternative technique based on genetic programming.

As mentioned in Section 1, chemometric techniques are widely used for analysing spectra. While there are many such techniques, the two chosen in this study are PCR and PLS, as they are particularly well established for the classification of spectroscopic data [11], [12],

Comparison of techniques

The PLS, PCR, neural network and genetic programming techniques that were discussed in Section 3 have been applied to the task of predicting the presence/absence of each solvent (described in Section 2). For comparison purposes, the authors have also included results using three other popular general machine learning techniques, Naïve Bayes, Ripper and C4.5, as implemented in WEKA [18], using the default settings. For all algorithms, the same sub-divisions of the data were used for training,

Conclusions and future work

This paper has described the value of genetic programming for Raman spectral classification and has introduced an improved fitness function to reduce the risk of misclassification of future samples. Genetic programming identified all solvent samples correctly with little configuration and the equations generated provide an insight into how decisions are made which offers an advantage over other techniques such as PLS, PCR and neural networks. This is very important in ‘real world’ practical

Acknowledgements

This work was supported by funding from Enterprise Ireland's Commercialisation Fund Technology Development Programme (TD/03/212) and by the National Centre for Biomedical Engineering Science as part of the Higher Education Authority Programme for Research in Third Level Institutions.

Cited by (19)

  • Model-driven regularization approach to straight line program genetic programming

    2016, Expert Systems with Applications
    Citation Excerpt :

    Main subjects in unsupervised learning like clustering have been approached using GP (see Bezdek, Boggavarapu, Hall, & Bensaid, 1994; Falco, Tarantino, Cioppa, & Fontanella, 2004; Folino, Pizzuti, & Spezzano, 2008; Jie, Xinbo, & Li-cheng, 2003). Supervised classification by evolving selection rules is another avenue in which GP obtains a remarkable success as shown, for example, in Carreño, Leguizamón, and Wagner (2007), Cano, Herrera, and Lozano (2007), Chien, Yang, and Lin (2003), Freitas (1997), Hennessy, Madden, Conroy, and Ryder (2005) and Kuo, Hong, and Chen (2007). Singular applications to medicine and biology problems (Aslam, Zhu, & Nandi, 2013; Bojarczuk, Lopes, & Freitas, 2000; Bojarczuk, Lopes, Freitas, & Michalkiewicz, 2004; Castelli, Vanneschi, & Silva, 2014), feature extraction methods (Krawiec, 2002; Smith & Bull, 2005), database clustering and rule extraction (Wedashwara, Mabu, Obayashi, & Kuremoto, 2015), generation of hybrid multi-level predictors for function approximation and regression analysis (Tsakonas & Gabrys, 2012) are other examples in which GP is applied.

  • Genetic programming based quantitative structure-retention relationships for the prediction of Kovats retention indices

    2015, Journal of Chromatography A
    Citation Excerpt :

    Despite its several attractive properties and significant potential, the GP-based SR has not been utilized as frequently as ANN and support vector regression (SVR) formalisms in the various science, engineering and technology branches. Some of the applications of the GP in chemical sciences and engineering, are soft-sensor development for biochemical systems [55], fermentation modeling [56], electronic nose [57], synthesis of heat-integrated complex distillation systems [58], classification of Raman spectra [59], optimization of a controlled release pharmaceutical formulation [60], modeling of a nanofiltration process [61], prediction of higher heating values of biomasses [62] and multiple alignment of liquid chromatography–mass spectrometry data [63]. An exhaustive literature search (also see Table 1) has revealed that the GP formalism has not been used in the development of QSRRs; it has also been rarely employed in the chromatography science.

  • Soft-sensor development for biochemical systems using genetic programming

    2014, Biochemical Engineering Journal
    Citation Excerpt :

    In a noteworthy study, Schmidt and Lipson [17] have demonstrated that the symbolic regression can be employed to search the “natural law” underlying a physical phenomenon (pendulum dynamics). Other applications of the GP include bioprocess monitoring [18], fermentation modeling [19], electronic nose [20], synthesis of heat-integrated complex distillation systems [21], classification of Raman spectra [22], and optimization of a controlled release pharmaceutical formulation [23]. The basic MLP structure portrayed in Fig. 6 is composed of three layers, namely input, hidden and output layers consisting of N, M and L processing elements (also termed “nodes” or “neurons”), respectively (where L = 1).

  • Multiple imputation and genetic programming for classification with incomplete data

    2017, GECCO 2017 - Proceedings of the 2017 Genetic and Evolutionary Computation Conference
View all citing articles on Scopus
View full text