Keywords

1 Introduction

Ovarian cancer has a distinctive biology and behavior at the clinical, cellular and molecular levels. It is the most prevalent and lethal female reproductive cancer, accounting for 5% of female cancer deaths. According to National Cancer Institute around 22,440 women will get diagnosed by this disease and 14,080 cases will die due to the disease by 2017 [14]. The five-year overall survival rate of this disease is 46.5% when untreated. Whereas, early detection of the disease with proper treatment can increase the overall survival rate of patients, that is, 92.5%. Therefore, it is important to understand the role of biomarkers like miRNAs and mRNAs in various pathways of ovarian cancer.

MicroRNA (miRNAs) are small non-coding RNAs of size \(\sim \) 22-nucleotides. miRNA suppresses the expression of mRNA by binding to the \(3^{'}\) untranslated region of the mRNA. They are found in many plants and animals. Extensive studies have been conducted to understand their role in different biological processes and diseases [1, 6, 10]. Studies related to the role of miRNAs and their targets in ovarian cancer is less studied. Only nine papers related to this topic are available in Pubmed. Therefore, there is dire need to conduct studies related to this topic. To come up with solutions for developing a diagnostic and prognostic tool against ovarian cancer. Existing methods usually use sequence data for identification of miRNAs and their targets. However, there exists a higher possibility of false positive rates. Therefore, few works have been done that used miRNA and mRNA expression data. However, few of them select miRNAs and mRNAs separately and then by using the correlation between the selected biomarkers reconstruction of regulatory modules take place. Other methods use regression methods they require more computational time. Hence, there is a need to develop a scalable approach for identification of miRNA-mRNA regulatory modules in ovarian cancer.

This paper presents a framework for selection of important miRNA-mRNA regulatory modules in ovarian cancer. For selection of regulatory modules, mutual information based maximum-relevance maximum-significance (MIMRMS) [13] has been used. Here, a set of genes that are regulated by a particular miRNA is identified with MIMRMS. In the current study, the expression values of miRNAs are discretized and used them as class labels. Whereas, the expression values of genes are considered features. Mutual information between two variables here miRNA and mRNA suggests about the interdependency between them. The MIMRMS algorithm selects a set of genes for a particular miRNA by maximizing both relevance and significance of the gene. In this manner, a set of gene is selected that is both relevant and significant with respect to that miRNA. The miRNA information is used as a class label and mRNAs are later selected with the help of MIMRMS algorithm. For a particular miRNA, a set of 50 mRNAs is selected using the MIMRMS algorithm. The mRNAs of each module are evaluated further with the help of K-nearest neighbor classifier in order to reduce false positives. The effective mRNAs obtained represent a regulatory module, that is, a miRNA regulating a set of mRNAs. Next, to avoid irrelevant modules statistical significance of each module is computed using STRING database. Pathway enrichment analysis, and disease ontology enrichment analysis revealed the importance of selected modules with respect to ovarian cancer. The modules generated by MIMRMS are compared with the modules generated by mRMR algorithm as well as MatrixEQTL. From the results, it is revealed that the MIMRMS based approach generates more significant miRNA-mRNA regulatory modules for Ovarian cancer data.

2 Construction of miRNA-mRNA Modules

Automatic detection of miRNA-mRNA modules is very important to understand the underlying mechanism of the disease. This section describes the method that has been used for identification of miRNA-mRNA modules in ovarian cancer. In the present work, the MIMRMS [12] has been used to identify miRNA-mRNA regulatory modules.

Provided the matrices of miRNA expression and gene expression a decision matrix is created first. A decision matrix contains a class label attribute and conditional attributes. Here, the expression values of each miRNA are discretized and later used it as class label. All the expression values of genes are considered as features or conditional attributes. The rows represent samples. Therefore, total 175 decision matrices are created each having dimension of 415 rows and 13,946 columns and one class label. For each miRNA, a set of mRNAs is selected by implementing MIMRMS algorithm. Next, the K-nearest neighbor algorithm is applied on the genes for each module for selecting an effective set of genes that generates high classification accuracy. Biologically it can be interpreted as those genes that are regulated by a particular miRNA. The aim of this study was to select a set of relevant as well as significant genes that can map on miRNA. Thus, generating a regulatory network that may potentially have some role in the onset and progression of ovarian cancer. Next, the existing MIMRMS algorithm and K-nearest neighbor algorithms are described.

2.1 The Gene Selection Algorithm

This section describes about the existing MIMRMS algorithm [12] that has been used in the current study. The MIMRMS generates a set of mRNAs by maximizing both relevance as well as significance. The MIMRMS algorithm is described next.

The MIMRMS algorithm selects a set of mRNAs \(\varTheta \) from a given microarray data set \({\mathbb C}=\{{\mathscr {G}}_1,\cdots , {\mathscr {G}}_i,\cdots ,{\mathscr {G}}_j, \cdots ,{\mathscr {G}}_m\}\) of m mRNAs. Relevance of a mRNA quantifies the correlation of the mRNA with respect to class label or miRNA. Also, it infers about the dependency of the class label \({\mathbb M}\) on an attribute. Here, the relevance of the mRNA \({\mathscr {G}}_i\) with respect to class labels /miRNAs \({\mathbb M}\) is defined as \(\hat{f}({\mathscr {G}}_i,{\mathbb M})\). Whereas, \({\tilde{f}}({\mathscr {G}}_i,{\mathscr {G}}_j)\) is defined as the significance of the mRNA \({\mathscr {G}}_j\) with respect to the mRNA \({\mathscr {G}}_i\). In this study for calculation of both relevance and significance mutual information [11] is used [12].

The relevance \(\hat{f}({\mathscr {G}}_i,{\mathbb M})\) of a mRNA \({\mathscr {G}}_i\) with respect to the class label or miRNA \({\mathbb M}\) using mutual information can be computed as follows:

$$\begin{aligned} \hat{f}({\mathscr {G}}_i,{\mathbb M}) = I({\mathscr {G}}_i,{\mathbb M}), \end{aligned}$$
(1)

where \(I({\mathscr {G}}_i,{\mathbb M})\) represents the mutual information between attribute \({\mathscr {G}}_i\) and miRNA or class label \({\mathbb M}\) that is given by

$$\begin{aligned} I(\mathscr {G}_i, \mathbb {M})= H(\mathscr {G}_i) - H(\mathscr {G}_i \mid \mathbb {M}). \end{aligned}$$
(2)

Here, \(H(\mathscr {G}_i)\) and \(H(v_i \mid \mathbb {M})\) represent the entropy of mRNA \(\mathscr {G}_i\) and the conditional entropy of \(\mathscr {G}_i\) given class label \(\mathbb {M}\), respectively. The entropy is a measure of uncertainty.

Provided a set of attributes individual contribution of an attribute for calculation of dependency on decision attribute can be computed with the help of significance criterion. Hence, significance value of an attribute signifies its importance. Removal of an attribute from the set of condition attributes leads to change in dependency value. This change is the significance of the attribute. Its value ranges from 0 to 1. If its value is 0 (1), then the attribute is dispensable (indispensable).

Definition 1

Given \({\mathbb C},{\mathbb M}\) and an attribute \(\mathscr {G} \in {\mathbb C}\), the significance of the attribute \(\mathscr {G}\) is defined as [12]:

$$\begin{aligned} {\sigma _{\mathbb C}}({\mathbb M},\mathscr {G})=\hat{f}({\mathbb C},{\mathbb M})- \hat{f}({\mathbb C}-\{\mathscr {G}\},{\mathbb M}) \end{aligned}$$
(3)

The total relevance of all selected mRNAs and total significance among the selected mRNAs are, therefore, given by

$$\begin{aligned} {\mathscr {J}}_\mathrm{relev}= \sum _{{\mathscr {G}}_i \in {\varTheta }} \hat{f}({\mathscr {G}}_i,{\mathbb M})~~~~~~~~~~~~~~~~ {\mathscr {J}}_\mathrm{signf}= \sum _{{\mathscr {G}}_i \ne {\mathscr {G}}_j \in \varTheta } {\tilde{f}}({\mathscr {G}}_i,{\mathscr {G}}_j). \end{aligned}$$
(4)

For identification of miRNA-mRNA module, first of all, a decision table is created for each miRNA. The decision table contains gene or mRNA as conditional attributes and miRNA as class label. The rows are samples. The MIMRMS algorithm is implemented on each decision table for identification of genes or mRNAs that are associated with that particular miRNA. The MIMRMS process starts by initializing \({\mathbb C} \leftarrow \{{\mathscr {G}}_1,\cdots , {\mathscr {G}}_i,\cdots ,{\mathscr {G}}_j,\cdots ,{\mathscr {G}}_m\}, \varTheta \leftarrow \emptyset \). Next, it calculates relevance \(\hat{f}({\mathscr {G}}_i,{\mathbb M})\) of each mRNA \({\mathscr {G}}_i \in {\mathbb C}\) with respect to class label or miRNA. Most relevant mRNA \({\mathscr {G}}_i\) is selected having highest relevance value \(\hat{f}({\mathscr {G}}_i,{\mathbb M})\). In effect, \({\mathscr {G}}_i \in \varTheta \) and \({\mathbb C}={\mathbb C} \setminus {\mathscr {G}}_i\). The algorithm iteratively computes significance of each mRNA with respect to already selected mRNAs and selects the mRNA if it has maximum value for optimization function. As a result of that, \({\mathscr {G}}_j \in \varTheta \) and \({\mathbb C}={\mathbb C} \setminus {\mathscr {G}}_j\). This step occurs till the desired number of mRNAs are selected for corresponding miRNA or class label. The optimization function of the MIMRMS algorithm is

$$\begin{aligned} \hat{f}({\mathscr {G}}_j,{\mathbb M})+ \frac{1}{|\varTheta |} \sum _{{\mathscr {G}}_i \in \varTheta }{\tilde{f}}({\mathscr {G}}_i,{\mathscr {G}}_j). \end{aligned}$$
(5)

Mutual information is used to compute both relevance and significance of a mRNA. The relevance and significance of a mRNA are calculated using (1) and (3), respectively.

The expression values of both miRNA and mRNA in a microarray data are continuous in nature. Continuous expression values of a miRNA and mRNA need to be discretized for calculation of relevance of a mRNA with respect to miRNA or clinical outcome using mutual information. The marginal probabilities and the joint probability are computed using discretized expression values of a mRNA and miRNA. These probabilities are later used to compute the mRNA-class/miRNA relevance. Therefore, discretization of continuous valued miRNAs and mRNAs is a very vital step in the current study. In the current study discretization method mentioned in [4] is used. This method discretizes expression values of a miRNA and mRNA using mean \(\mu \) and standard deviation \(\sigma \) that are computed over n expression values of that particular miRNAs or mRNA. Next, the values bigger than \((\mu + \sigma )\) is represented as 1, the values between \((\mu - \sigma )\) and \((\mu + \sigma )\) as 0 and the values smaller than \((\mu - \sigma )\) as \(-1\). The over-expression, baseline, and under-expression of the miRNAs or mRNAs correspond to these three values.

2.2 K-Nearest Neighbor Rule

The K-nearest neighbor (K-NN) rule [5] is a classifier. It is used to evaluate the efficiency of a set of reduced mRNAs. It classifies an unknown sample by considering its nearest or closest training samples in the feature space. A sample is classified by a majority vote of its K-neighbors, with the sample being assigned to the class most common amongst its K-nearest neighbors. The value of K, chosen for the K-NN, is the square root of the number of samples in training set. In the current study, the mRNAs of each miRNA-mRNA module obtained using the MIMRMS algorithm are further processed. For each miRNA-mRNA module, K-NN is implemented for selecting best mRNAs for a particular miRNA. The mRNAs in a particular module generating highest accuracy values are considered further. Biologically it can be inferred that the mRNAs finally selected for a particular module are regulated by the miRNA of that module.

3 Experimental Results

In the current study, the existing MIMRMS algorithm is used to identify regulatory modules. Fifty top-ranked mRNAs are selected using the MIMRMS algorithm for further analysis. For MatrixEQTL top 50 mRNAs of each module is directly used for further analysis as the ranking of the mRNAs was not sured. In a module filtering of mRNAs is further carried out to reduce false positives. Therefore, prediction accuracy of K-nearest neighbor (K-NN) rule along with leave-one-out cross-validation (LOOCV) is computed for the mRNAs of each module. Finally, the obtained modules are evaluated using STRING database [17], pathway enrichment analysis, and disease ontology. Both miRNA and mRNA expression data for serous ovarian cancer were downloaded through the Cancer Genomics Browser of UC Santa Cruz [3]. Both data contain exactly same samples, that is, 415. Whereas, the number of miRNAs and genes are 175, and 13,946, respectively. The effectiveness of the proposed approach is compared with the methods mentioned in Huang and Cai [7] and Matrix eQTL [15]. Huang and Cai used minimum redundancy maximum relevance criteria [4] for selection of modules.

3.1 Selection of Significant Regulatory Modules

Total 175 modules are generated by implementing the MIMRMS and K-NN rule. The leave one out accuracy of each module varied from 0% to 48.43%. Next, STRING database [17] is used to generate connections between the genes of each module to check whether the genes of obtained modules are involved in same biological function or not. The database uses information from experimentally validated connections, prediction, text mining, and so forth for creating a connection between two genes or proteins. STRING database stores the information of protein-protein interaction. It also provides the statistical significance for a particular protein-protein interaction network (PPIN). The statistical significance of networks is quantified using P-value.

Table 1 represents the total number of significant regulatory networks (P-value < 0.05) generated by MatrixEQTL, mRMR, and MIMRMS algorithm. From the table, it is seen that MatrixEQTL, mRMR, and MIMRMS algorithms generate significant PPI network. However, only MatrixEQTL and MIMRMS algorithms generate Network with very low P-value = 0. The MatrixEQTL generates only one network with P-value = 0 whereas, the MIMRMS generates five highly significant P-value = 0 networks. The details of all six (one MatrixEQTL and five MIMRMS) modules are presented in Table 2. The images of few networks are provided in Fig. 1. From the figure, it is seen that the networks generated are highly inter-connected and compact. They also suggests that the MIMRMS based approach selects significant regulatory modules.

Table 1. Number of significant modules generated By MatrixEQTL, mRMR, and MIMRMS
Table 2. Description of most significant modules
Fig. 1.
figure 1

PPINs generated by STRING for MatrixEQTL and MIMRMS algorithms

3.2 Pathway Enrichment Analysis

For the biological interpretation of highly significant modules P-value = 0, the Cytoscape [16] plug-in ClueGO [2] has been used to perform pathway enrichment analysis. Genes of significant modules are used for pathway enrichment analysis. For the current analysis, the threshold for P-Value was set to 0.05 and the minimum number of genes associated with a term was set to 3. WikiPathways database [9] has been used as background database for the current study.

Fig. 2.
figure 2

Pathway enrichment analysis

Figure 2 represents pathway terms obtained by MatrixEQTL and MIMRMS. From the figure, it is seen that the module selected by MatrixEQTL contains genes that are mainly associated with the process of protein synthesis. It indicates that they are housekeeping genes. On the other hand, modules generated by MIMRMS algorithm generates modules whose members are more associated with pathways in cancer. However, two modules from MIMRMS algorithm selected housekeeping genes. The terms generated for MIMRMS modules like miRNA targets in ECM and membrane receptors, Senescence and Autophagy in Cancer, and so forth are cancer associated pathways.

3.3 Disease Ontology Enrichment Analysis

Further analysis of the most significant networks (one MatrixEQTL and 5 MIMRMS) was done using disease ontology (DO) enrichment analysis. The R package DOSE [18] was used. This package identifies a statistically significant disease ontology term that is associated with a set of genes. Here, DO id’s with P-value 0.05 are selected. Table 3 represents the DO terms and their respective P-values. From the table, it is seen that the MatrixEQTL do not generate any relevant DO term with respect to Ovarian cancer. Whereas, one of the modules of the MIMRMS generates DO term that is highly relevant to Ovarian cancer (bold text). The result indicates that the MIMRMS algorithm efficiently selects regulatory networks compare to other existing methods. According to miR2Disease [8] all the miRNAs mentioned (both MatrixEQTL and MIMRMS) in Table 2 are associated with ovarian cancer.

Table 3. Comparative analysis of association of modules with diseases

4 Conclusions

The paper presents an integrative approach for automatic detection of regulatory network by applying the existing MIMRMS algorithm. The importance of MIMRMS algorithm over other existing algorithms is demonstrated in terms of identification of miRNA-mRNA regulatory modules. The MIMRMS algorithm generates more significant regulatory modules that are highly related to ovarian cancer. The obtained regulatory modules may be helpful for understanding the underlying etiology of the disease.