1 Introduction

Cancer, a label applied to a variety of diseases featuring excessive cell proliferation, is driven by changes at the genomic level, which define a distinct metabolic profile that supports the tumorigenic process. A common alteration, usually referred to as the Warburg effect [1], is the observation that cancer cells resort to glycolysis with subsequent lactate fermentation to produce energy, even under aerobic conditions. Many other metabolic changes have since been documented, and a recent review has identified six cancer metabolism hallmarks [2].

These changes in intracellular, extracellular, and circulating metabolites can be assessed by applying one of two approaches. Targeted studies focus on a selected subset of known metabolites, while untargeted studies attempt to profile the metabolome in a non-predefined manner. The metabolomics data can be obtained using techniques such as Nuclear Magnetic Resonance (NMR) spectroscopy, and Mass Spectroscopy, normally coupled to Gas or Liquid Chromatography (GC/LC-MS).

Table 1. Data analysis methods used in a selection of recent cancer NMR and MS studies.

NMR has been extensively used for several purposes in cancer studies, such as the distinction between tumor and normal samples [3], prediction of patient survival [4] and tumor recurrence [5], and monitoring tumor drug response [6]. On the other hand, applications of MS in cancer research include the characterization of metabolite signatures in lung cancer patients undergoing treatment [7], and several cases of metabolic profiling to find diagnostic/prognostic biomarkers of tumors like lung, colorectal, ovarian and hepatocellular tumors [8,9,10,11,12].

Univariate and multivariate statistical methods can be applied to analyze NMR and MS peaks data or even on the metabolites identified from the data of these techniques and respective concentrations. Table 1 shows a selection of the most relevant studies in cancer metabolomics using NMR and MS. The data analysis strategies will be presented in the following sections.

2 Univariate Analysis

Univariate analysis studies a data variable at a time, crossing its values with those of metadata variables, being easy to perform and interpret, using methods such as t-tests (TT), one-way and multifactor analysis of variance (ANOVA), MannWhitney (MW), Kruskal-Wallis (KW) and Kolmogorov-Smirnov (KS) tests, fold change (FC), regression and correlation analysis (CA). These can provide sets of (ranked) variables, candidates for a better discrimination of a clinical variable. Thus, these techniques are quite useful for biomarker prediction, as well as a first step in classification or regression with machine learning.

Specifically in metabolomic cancer studies, univariate analysis has been performed in many studies as is clear from the previous table. One example is the use of one-way ANOVA and Tukey’s Honest Significant Difference (HSD) test in studying NMR data from breast cancer cells [15]. Also, in a chemotherapy breast cancer study [33], the authors performed paired/unpaired t-tests over MS data, as well as two-way ANOVA to study the interaction of two variables.

3 Unsupervised Multivariate Analysis

This type of analysis summarizes data and thus detects patterns that can be related to biological or experimental variables.

Principal Component Analysis (PCA) is the most frequently used unsupervised learning method for data analysis, normally used in metabolomics to discover patterns in the data which may reveal how samples group based on their metabolic profiles. It is a dimensionality reduction technique, which produces new variables through linear combinations of the original variables [35], to explain as much of the variance in the original data set as possible.

In recent cancer studies using NMR, PCA has been applied, for instance, to discriminate between four groups of MCF7 breast cancer cell lines with or without tamoxifen resistance and/or CK-\(\upalpha \) downregulation [18], and to separate gastric cancer samples from control samples [3]. Regarding MS approaches, there are also some studies using PCA, for the detection of biomarkers related to prostate cancer, by combining it with supervised methods [36] or to access the different metabolic profiles of ovarian cancer stem cells and cancer cells [11].

On the other hand, Hierarchical clustering (HC) separates observations into groups and establishes a hierarchical ordering of the data points by taking into consideration a measure of dissimilarity between observations. In [15], HC was performed on metabolite concentration data derived from NMR experiments of different breast cancer cell lines to assess the effect of radiation therapy or poly ADP-ribose polymerase inhibition. In another study, the authors [13] used HC to evaluate the separation between advanced colorectal cancer samples and controls, based on data from NMR of fecal extracts. They did, however, conclude that PCA performed better at this task than HC. In [11], following a MS approach, HC demonstrated a clear separation between cell types, based on the intracellular profile of ovarian cancer stem cells and ovarian cancer cells, while in another MS cancer study, HC allowed the estimation of clinical metabolic biomarkers from plasma for diagnosis of esophageal squamous-cell carcinoma [37].

K-means is another clustering approach. It partitions observations into a pre-defined number k of groups. The algorithm is initialized considering k observations to be the initial clusters, and samples are assigned to the cluster with the nearest mean, recalculating the clusters after every assignment [38]. As an example of its application over MS data, in [39] the authors used it to identify metabolite signatures of malignant glioma from human cerebrospinal fluid.

4 Supervised Multivariate Analysis

On the other hand, supervised multivariate analysis creates models capable of predicting an output from a certain data input, based on data with known output.

Partial least squares (PLS) regression, partial least squares discriminant analysis (PLS-DA), and orthogonal partial least squares discriminant analysis (OPLS-DA) are the most popular supervised learning methods used in metabolomics studies. PLS [40] models the relationship between a matrix of predictor variables and one or more output variables by finding a set of new variables that maximize the explained covariance. PLS-DA is an adaptation of the partial least squares algorithm for classification, and is used to analyze group separation [41].

In recent metabolomics NMR and MS studies, PLS-DA has been used, for instance, to identify a urinary metabolite signature for renal cell carcinoma [19]. In [33], PLS-DA, using MS data, revealed a trend to separate premenopausal and postmenopausal samples, suggesting that altered serum levels of oleic acid in breast cancer patients are associated with their response to chemotherapy. OPLS-DA is a variant of PLS-DA in which non-correlated variation is removed to facilitate model interpretability [42]. It has been applied, for example, to discriminate between pancreatic adenocarcinoma and healthy tissue [4], and to differentiate between basal cell carcinoma and normal skin samples [21].

To build predictors in cancer metabolomics studies, random forests (RF) represent another model that can be used for classification or regression. RFs are ensembles of decision trees, which are made up of decision rules that are inferred from input data [43]. In an experiment, RF was used to determine if NMR data could distinguish between groups of cancer patients (with cachexia, pre-cachexia or weight stable) and healthy controls [26]. This RF was used as a feature selection step, evaluating the importance of each metabolite and subsequently selecting the fifteen most predictive metabolites. In another study using both NMR spectroscopy and MS [44], experimental data was used to train a RF that could distinguish between hepatocellular carcinoma, liver cirrhosis and control serum samples. The RF was valuable in selecting the most important metabolites that could accurately discriminate the groups and could be considered potential biomarkers. In another study, [9], RF models used MS data to train a set of lung cancer and control cases. The model revealed that three of the most highly well-known nicotine metabolites (cotinine, nicotine-N-oxide, and trans-3-hydroxycotinine) were the most important ones for the model to distinguish between both cases.

A Support Vector Machine (SVM) [45] is a machine learning method that maps input features to a new, linear feature space using a kernel function. Regarding NMR studies, [46] used a SVM with a radial basis function kernel to classify cell extracts from normal and hepatocellular carcinoma cell lines as well as the respective culture media. In [14], two supervised methods were combined - PLS was applied as a dimensionality reduction method and the resulting scores were used to train a SVM model to distinguish between patients with metastatic colorectal cancer and healthy individuals. In the same study, a PLS-SVM approach was also used to predict overall survival for the patients with metastatic colorectal cancer. In [47], SVM models were applied on MS data collected for sixteen diagnostic metabolites from lipid and fatty acid metabolism, allowing the identification of early-stage ovarian cancer patients.

5 Case Studies

Specmine [48] is an R package, developed in our group, for metabolomics data analysis that allows users to perform the analyses described in the previous sections, and many others. To demonstrate its usefulness in cancer metabolomics studies based on NMR and MS techniques, two studies were reproduced using the specmine package. The fully detailed reports can be accessed in the URL http://darwin.di.uminho.pt/PACBB2018/metabolomics.

The first study [49] analyzed the possible association of metabolism with the altered expression of the inositol 1,4,5 trisphosphate (IP3R) receptor in breast cancer, as this receptor is known to regulate metabolism and cellular bioenergetics and is upregulated in a number of cancers, by using the 1H CPMG NMR technique. Data for this analysis was obtained from the Metabolights website [50], under the study MTBLS152. The analysis performed included PCA and PLS-DA. Although there were some differences in terms of results, possibly due to the use of a dataset that is slightly different to the original file used by the authors, the specmine results confirm that PCA and PLS-DA were able to discriminate between samples with high and/or low expression of the gene that encodes inositol 1,4,5-trisphosphate receptor type 3 and healthy control samples.

The second study [11] analyzed the differences between ovarian cancer cells (OCCs) and cancer stem cells (OCSCs) as regards the intracellular and extracellular metabolomic profiles, by using the GC-MS technique. Data for this analysis was also obtained from Metabolights, under the study MTBLS152. The analysis performed included PCA and t-tests. Overall, the obtained results were very similar to the ones present in the article. Some of the differences may be due to the study authors not fully explaining how the analysis was conducted, especially regarding how they handled the fact that, in some cases, the same metabolite had different concentration levels for each sample.

6 Conclusions

Although the typical procedure in metabolomics data analysis usually involves PCA and PLS-DA/OPLS-DA analyses, most studies use a variety of data analysis methods that confirm and complement one another. Some recent cancer metabolomics studies have explored other machine learning techniques to build predictors based on NMR and/or MS data. These alternative predictors may be useful to build more robust classifiers and to extract biologically meaningful information from metabolomics data, such as identifying potential metabolic biomarkers. In the future, it would be interesting to see how these and other alternatives perform when compared to established methods.

Furthermore, with the reproduction of two studies using the specmine package, it is noticeable that this R package can be very useful in metabolomics data analysis, not only in univariate analysis, but also in multivariate analysis, such as machine learning and PCA.