Research Article
Meta-analysis of gene expression for development and validation of a diagnostic biomarker panel for Oral Squamous Cell Carcinoma

https://doi.org/10.1016/j.compbiolchem.2019.06.008Get rights and content

Abstract

We use a newly developed feature extraction and classification method to analyze previously published gene expression data sets in Oral Squamous Cell Carcinoma and in healthy oral mucosa in order to find a gene set sufficient for diagnoses. The feature selection technology is based on the relative dichotomy power concept published by us earlier. The resulting biomarker panel has 100% sensitivity and 95% specificity, is enriched in genes associated with oncogenesis and invasive tumor growth, and, unlike marker panels devised in earlier studies, shows concordance with previously published marker genes.

Graphical abstract

Highlights

  • Present a diagnostic biomarker panel for Oral Squamous Cell Carcinoma.

  • Use meta-analysis of gene expression data and sort features with a dichotomy power metric.

  • Results are reproducible - concordant with earlier studies.

  • Explain irreproducibility of molecular biomarkers for this cancer seen in earlier studies.

  1. Download : Download high-res image (85KB)
  2. Download : Download full-size image

Introduction

Oral Squamous Cell Carcinoma (OSCC) is the most common form (96% of cases) of oral cancer, and is a source of significant morbidity and mortality and hence, a large public health issue (Siegel et al., 2012; Neville and Day, 2002). Due to the aggressive growth of the tumor its early diagnosis is of critical importance for successful treatment (Van Hooff et al., 2012; Saintigny et al., 2011; Peng et al., 2011; Garnis et al., 2009; Chen, 2008; Liu et al., 2010; Silverman, 1988). In combination with modern progress in genomics this need resulted in an explosive development of genetic biomarker panels for detection of OSCC (Garnis et al., 2009; Chen, 2008; Exarchos et al., 2012; Ye and Yu, 2008; Gleber-Netto et al., 2016; Menke et al., 2017; Dahiya and Dhankhar, 2016), prediction of disease progression (Van Hooff et al., 2012; Saintigny et al., 2011; Exarchos et al., 2012; Dahiya and Dhankhar, 2016; Watanabe et al., 2008; Roepman et al., 2006a, 2005; Rickman et al., 2008; Nagata et al., 2005; Zhou et al., 2006; Exarchos et al., 2011; Méndez et al., 2011; Tomkiewicz et al., 2012; Nguyen et al., 2007; Yun et al., 2018; Wintergerst, 2018; Chung et al., 2018; Enokida et al., 2017; Gao et al., 2018; Chauhan et al., 2015; Onken et al., 2014; Mes et al., 2017; Shen et al., 2017; De Cecco et al., 2015; Parris et al., 2014; Chen et al., 2015), and for individual’s response to a particular treatment, thus enabling personalized approaches to therapy (Watanabe et al., 2008; Lohavanichbutr et al., 2013; Reis et al., 2011; Bhattacharya et al., 2011; MacConaill et al., 2009; Severino et al., 2008; Floyd and McShane, 2004; Sidransky, 2002). (In our references, we include only gene-expression-based biomarker panels). Attempts are also made to differentiate OSCC sub-types based on gene expression profiles (Chen et al. (2015); Bavle et al. (2016)) or to identify genes affiliated with staging of the disease (Randhawa and Acharya, 2015).

There is no one established framework for discovery of biomarkers from high-throughput data (Cochran, 1977; Saeys, 2007). Although the raw number of proposed algorithms is quite large (for review, see (Exarchos et al., 2012, 2011; Duval and Hao, 2009; Jiang et al., 2007; Goeman and Buhlmann, 2007; Shi et al., 2010)), their broad applicability and relative advantages are debatable. In 2010 the MAQC-II consortium (Klein et al., 2009) attempted to systematically evaluate various marker derivation techniques. They concluded that while it is possible to derive robust markers for various conditions (“targets” in MAQC terminology, that we will also use here), there is no singular statistical algorithm for doing so, and the most important factor contributing to the quality of the markers is the experience of the team that developed them. In general, discovery of clinically validated cancer biomarkers is currently a cumbersome process (Exarchos et al., 2011; Sawyers, 2008). Sometimes the features chosen for the biomarker panel undergo further selection to satisfy the key biomarker criteria (Strimbu and Tavel, 2010). Among the earlier studies quoted in this paper, a vast variety of methods was used: z-score cut-offs to pick outliers followed by logistic regression (Chen, 2008); Principal Component Analysis followed by the "mixed-effect" linear model to identify differentially expressed genes that exhibited fold-change over the conventional 2-fold cutoff (Ye and Yu, 2008); Cox regression (Lohavanichbutr et al., 2013); support vector machines (Ziober et al., 2008) in combination with the Golub method of “weighted vote” (Golub et al., 1999).

One outstanding feature of all genetic biomarkers developed for the OSCC is lack of consistency between the sets of features selected as markers by different researches. It results in the lack of cross-study reproducibility. This phenomenon is not special to OSCC but is commonly observed for many conditions, and is known as molecular signature multiplicity. It has been well illustrated in the literature and analyzed from a mathematical standpoint (Statnikov and Aliferis, 2010). Most stunning examples of molecular signature multiplicity include those where the same target condition is equally well recognized by different marker panels that do not share common genes, where such discordant signatures were developed by the same team (Statnikov and Aliferis, 2010). Molecular signature multiplicity has a trivial combinatorial explanation: the features selected to distinguish the targets in one experiment do not have to be the same as those selected in another experiment, and neither has to correspond to those that have a causal relationship with the target condition. In addition, in cancer gene expression and sequencing studies there are two significant sources of noise that complicate the analysis. One is due to oncogenesis being a result of multiple discordant mutations, as is currently accepted (Bhattacharya et al., 2011; Severino et al., 2008; Roepman et al., 2006b). Another is sample heterogeneity – mixed tumor and non-tumor (stroma) cells within a sample (Klein et al., 2009). As a result, any single gene variation may not be consistent when multiple samples are considered. This worsens performance of conventional feature extraction algorithms (see (Severino et al., 2008; Roepman et al., 2006b) and references therein).

This paper is focused on the diagnostic markers of OSCC. We have collected gene signatures and performance figures from recent literature in Table 1.

Many other genes were associated with OSCC progression (Table 2), however, they were not used for the purpose of diagnosing OSCC by the respective authors.

We note that the vast majority of authors aim to limit the size of the biomarker signature to just a few probes, with an exception of ref. (Ye and Yu (2008)) where multiple both over- and under- expressed genes were used. Apart from metalloproteinases MMP-1 through 13 there is no concordance between the diagnostic gene signatures.

This paper aims to apply a feature extraction technique developed by us earlier (Makarov and Gorlin, 2018) to the OSCC diagnosis. Specifically, we would like to identify those genes for which the changes in gene expression result in the best quality differentiation between OSCC and healthy tissue samples. These may not be the same genes that exhibit the greatest differential expression between the two states or not the same genes that are involved in the process of oncogenesis.

Section snippets

Methods

Our computational biomarker discovery technology is described in detail in (Makarov and Gorlin, 2018). It is a filter feature selection and classification method (refer to (Liu and Motoda, 1998) for an overall review of various computational classification methods). We outline it here in brief. Our method uses large multivariate data sets that typically result from experiments that monitor changes in RNA, small molecule, or protein abundance. The principal difficulty in building of classifiers

Results and discussion

Using the mentioned data sets we produced a marker panel that in tests shows 100% sensitivity and 95% specificity (98% overall accuracy; Matthew’s correlation coefficient (Matthews, 1975; Boughorbel et al., 2017) 95.5%). The report of genes included in the panel is provided in Table 3.

The genes are listed in the order of relative dichotomy power (RDP) and separately for the OSCC and normal sample recognition targets. For completeness, we included all genes that were assigned relative dichotomy

Declaration of Competing Interest

The authors hereby disclaim any conflicts of interest that may inappropriately influence their work.

Acknowledgement

This work was supported by the US Army Medical Research and Materiel Command under Contract No. W81XWH-11-C-0036. The views, opinions and findings contained in this report are those of the authors and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation.

References (74)

  • S. Boughorbel

    Optimal classifier for imbalanced data using Matthews correlation coefficient metric

    PLoS One

    (2017)
  • S.S. Chauhan

    Prediction of recurrence-free survival using a protein expression-based risk classifier for head and neck cancer

    Oncogenesis.

    (2015)
  • C. Chen

    Gene expression profiling identifies genes predictive of oral squamous cell carcinoma

    Cancer Epidemiol. Biomarkers Prev.

    (2008)
  • S.J. Chen

    Ultra-deep targeted sequencing of advanced oral squamous cell carcinoma identifies a mutation-based prognostic gene signature

    Oncotarget

    (2015)
  • J.H. Chung

    SOX2 activation predicts prognosis in patients with head and neck squamous cell carcinoma

    Sci. Rep.

    (2018)
  • K. Dahiya et al.

    Updated overview of current biomarkers in head and neck carcinoma

    World J. Methodol.

    (2016)
  • L. De Cecco

    Head and neck cancer subtypes with biological and clinical relevance: meta-analysis of gene-expression data

    Oncotarget

    (2015)
  • B. Duval et al.

    Advances in metaheuristics for gene selection and classification of microarray data

    Brief Bioinform.

    (2009)
  • T. Enokida

    Gene expression profiling to predict recurrence of advanced squamous cell carcinoma of the tongue: discovery and external validation

    Oncotarget.

    (2017)
  • K. Exarchos

    Gene expression profiling towards the prediction of oral cancer reoccurrence

    33rd Annual International Conference of the IEEE EMBS

    (2011)
  • K.P. Exarchos

    A multiscale and multiparametric approach for modeling the progression of oral cancer

    BMC Med. Inform. Decis. Mak.

    (2012)
  • E. Floyd et al.

    Development and use of biomarkers in oncology drug development

    Toxicol. Pathol.

    (2004)
  • J. Gao

    Twenty‑four signature genes predict the prognosis of oral squamous cell carcinoma with high accuracy and repeatability

    Mol. Med. Rep.

    (2018)
  • C. Garnis

    Genomic imbalances in precancerous tissues signal oral cancer risk

    Mol. Cancer

    (2009)
  • F.O. Gleber-Netto

    Salivary biomarkers for detection of oral squamous cell carcinoma in a taiwanese population

    Clin. Cancer Res.

    (2016)
  • J.J. Goeman et al.

    Analyzing gene expression data in terms of gene sets: methodological issues

    Bioinformatics

    (2007)
  • T.R. Golub

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • Y. He

    Largescale transcriptomics analysis suggests over-expression of BGH3, MMP9 and PDIA3 in oral squamous cell carcinoma

    PLoS One

    (2016)
  • W. Jiang

    Biomarker-adaptive threshold design: a procedure for evaluating treatment with possible biomarker-defined subset effect

    J. Natl. Cancer Inst.

    (2007)
  • H.-U. Klein

    Quantitative comparison of microarray experiments with published leukemia related gene expression signatures

    BMC Bioinform.

    (2009)
  • C. Kuropkat

    Tumor marker potential of serum matrix metalloproteinases in patients with head and neck cancer

    Anticancer Res.

    (2002)
  • H. Liu et al.

    Feature Selection for Knowledge Discovey and Data Mining

    (1998)
  • X. Liu

    Gene expression signatures of lymph node metastasis in oral Cancer: molecular characteristics and clinical significances

    Curr. Cancer Ther. Rev.

    (2010)
  • P. Lohavanichbutr

    A 13-gene signature prognostic of HPV-negative OSCC: discovery and external validation

    Clin. Cancer Res.

    (2013)
  • L.E. MacConaill

    Profiling critical cancer gene mutations in clinical tumor samples

    PLoS One

    (2009)
  • E. Méndez

    Can a metastatic gene expression profile outperform tumor size as a predictor of occult lymph node metastasis in oral cancer patients?

    Clin. Cancer Res.

    (2011)
  • View full text