Genome-based identification of diagnostic molecular markers for human lung carcinomas by PLS-DA

https://doi.org/10.1016/j.compbiolchem.2005.04.005Get rights and content

Abstract

Partial least squares discriminant analysis (PLS-DA) provides a sound statistical basis for the selection of a limited number of gene transcripts most effective in discriminating different lung tumoral histotypes. The potentialities of the PLS-DA approach are pointed out by its ability to identify genes which, according to current knowledge, are considered molecular markers for colon cancer diagnostics and classification. Indeed application of PLS-DA to in vivo data allowed identification of a set of genes able to discriminate primary lung tumours from colon metastases.

Introduction

The advent of the novel technology of cDNA microarrays, producing large amount of data, pointed out the potentialities of multivariate approaches in handling huge databases.

Our group has recently applied multivariate methods to the National Cancer Institute (NCI) database providing 9703 cDNAs or gene transcripts representing ∼8000 unique genes expressed in 60 human tumour cell lines, including leukaemia, non-small-cell lung, central nervous system, colon, melanoma, ovarian, breast, prostate and renal tumour cells, as well as in vitro data on the activity of several anticancer drugs with respect to the same cell lines (Database resources currently available on the World Wide Web: http://dtp.nci.nih.gov/webdata.html).

In particular, the first multivariate insight into the above database estimated the effect of biochemical cell line properties, such as the molecular targets (Musumarra et al., 2001a), as well as the influence of gene transcripts on the efficacy of drugs which act by the same mechanism and for novel active compounds (Musumarra et al., 2001b). Moreover, the identification of relevant uncharacterized gene expression targets, deserving high priority in future molecular studies (Musumarra et al., 2001b) suggested shortcuts in genome-based cancer pharmacology research.

In this context, partial least squares (PLS) appears a suitable statistical tool to handle genetic databases with the aim to develop new diagnostic tests. Our group (Musumarra et al., 2003) used PLS-discriminant analysis (PLS-DA) (Wold et al., 1998) to select, from an original 9605 variable data set, a few gene expressions most effective in discriminating different tumoral histotypes such as: melanoma, colon, leukaemia, renal, and CNS tumour cells. For melanoma, in position 10 out of 9605, we found protein S-100, a prognostic parameter in patients with metastatic melanoma and a marker for melanoma metastasis (Orchard, 2000, Schlagenhauff et al., 2000), while MUC 13 and S100P proteins were suggested (Musumarra et al., 2004) as candidates for the development of new colon cancer diagnostics.

All the above results, however, were obtained by applying multivariate approaches to databases reporting in vitro data. The difficulties in extending in vitro results to in vivo responses, involving complex biochemical processes, are well known. In order to overcome the above problems, we here report a multivariate insight into a pathological data set including mRNA expression profiling (12,600 transcript sequences) for 17 normal lung specimens and 186 lung tumour samples (Bhattacharjee et al., 2001). Among these, 125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent sections. Application of hierarchical probabilistic clustering allowed the definition of distinct subclasses of lung adenocarcinoma and therefore a new molecular taxonomy of such tumours, demonstrating the potential power of gene expression profiling in lung cancer diagnosis (Bhattacharjee et al., 2001). The above approach, however, performs qualitative classification into sub-groups, preventing clustering of samples assigned to different main groups.

In this work we apply multivariate methods, such as principal components analysis (PCA) (Wold, 1987) and PLS-DA able to provide quantitative parameters, with the aim to extract further information from the above database, accompanied by precious histological and clinical data. In particular PCA will be exploited for pattern recognition of the overall data set and for soft independent modelling of class analogy (SIMCA) classification based on mRNA expression profiling data. Once SIMCA classification confirms assignment of samples to histologically determined classes by means of a quantitative statistical parameter (DModX, see Section 2), the variable (gene) importance in discriminating membership to a given class (i.e. different lung tumours) can be evaluated by PLS-DA (Musumarra et al., 2003) establishing cause–effect relationships between a set of descriptor variables (the X gene matrix) and an appropriately selected dependent variable matrix (the Y dummy matrix). These results may be relevant to identify new lung cancer molecular markers with diagnostic value.

Section snippets

Methods

The data set was taken from Bhattacharjee et al. (2001) (published on the PNAS web site www.pnas.org and at www.genome.wi.mit.edu/MPR/lung) in which 203 snap-frozen lung specimen were characterized by multivariate biological “fingerprints” given by 12,600 transcript sequences.

PCA (Wold, 1987) on the above data matrix was carried out using the SIMCA software package (SIMCA-P 8.0 by Umetrics) adopting variables autoscaling which consists in multiplying the variables by appropriate weights (the

Results and discussion

The database can be represented in the form of a matrix, where the 203 lung specimen (i.e. objects) may be represented as characterized by a multivariate biological fingerprint, given by the gene expression profiles (the 12,600 “descriptor” variables). PCA carried out on the entire data set including 203 objects and 12,530 variables (70 variables were excluded from the original database due to more than 50% missing values) provided a 5-principal component (PC) model (see Table 1) explaining

Conclusion

In conclusion, the validity of PCA and PLS-DA for selection of discriminating genes for different lung tumour subclasses is confirmed by identification of several markers already in clinical use. The above finding is particularly relevant for genome-based differentiation of colon metastases from primary lung tumours. Extension of the approach adopted in the present work to mRNA expression profiling of other metastatic tumours might be extremely useful in designing new tools for the resolution

Acknowledgement

Financial support of the University of Catania is gratefully acknowledged.

References (20)

There are more references available in the full text version of this article.

Cited by (24)

  • 4.21 - Data Processing for RNA/DNA Sequencing

    2020, Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Second Edition: Four Volume Set
  • Comparison of two immersion probes coupled with visible/near infrared spectroscopy to assess the must infection at the grape receiving area

    2018, Computers and Electronics in Agriculture
    Citation Excerpt :

    A matrix of artificial (dummy) variables, assuming a discrete numerical value (zero or one), was used as Y data. In case of two classes, the Y dummy matrix was constructed so that the value of the objects belonging to the first class was zero (e.g. class ILO), and the value of the other class was one (e.g. class IL1) (Musumarra et al., 2005; Liu et al., 2008). In this context, PLS-DA was carried out to assess the possibility to distinguish the IL0 from the IL1.

  • Rapid evaluation of grape phytosanitary status directly at the check point station entering the winery by using visible/near infrared spectroscopy

    2017, Journal of Food Engineering
    Citation Excerpt :

    A matrix of artificial (dummy) variables, assuming a discrete numerical value (zero or one), was used as Y data. The Y dummy matrix was constructed so that the value of the objects belonging to the class was one, and the value of all other objects was zero (Liu et al., 2008; Musumarra et al., 2005). In this context, PLS-DA was carried out to assess the possibility to distinguish the IL0 from the IL1.

  • Metabolomics

    2017, Nutrition in the Prevention and Treatment of Disease
  • Influence of packaging in the analysis of fresh-cut Valerianella locusta L. and Golden Delicious apple slices by visible-near infrared and near infrared spectroscopy

    2016, Journal of Food Engineering
    Citation Excerpt :

    A matrix of artificial (dummy) variables, assuming a discrete numerical value (zero or one), was used as Y data. The Y dummy matrix was constructed so that the value of the objects belonging to the class was one, and the value of all other objects was zero (Musumarra et al., 2005; Liu et al., 2008). In this context, PLS-DA was carried out to assess the possibility to distinguish the samples before the expiration date (BED, class zero) from the samples after the expiration date (AED, class one), for both fresh-cut leaves and apple slices.

  • Monitoring of fresh-cut Valerianella locusta Laterr. shelf life by electronic nose and VIS-NIR spectroscopy

    2014, Talanta
    Citation Excerpt :

    A matrix of artificial (dummy) variables, assuming a discrete numerical value (zero or one), was used as Y data. The Y dummy matrix was constructed so that the value of the objects belonging to the class was one, and the value of all other objects was zero [27,28]. In this context, PLS-DA was carried out to assess the evolution of the fresh-cut Valerianella during storage.

View all citing articles on Scopus
View full text