Identifying microRNAs involved in cancer pathway using support vector machines

https://doi.org/10.1016/j.compbiolchem.2015.01.007Get rights and content

Highlights

  • Construction of a two-step SVM classifier for identifying miRNA associated with cancer.

  • Features are extracted from sequence, thermodynamics and miRNA–mRNA hybridization interactions based on experimentally data.

  • For miRSEQ – Positions 1, 6, 10, 19, GG and CC repeat in the miRNA sequence form the optimal feature subset.

  • Optimal features vary significantly based on the number of seed formed by hybrid for miRINT.

  • Final classifier obtained a good performance with cv-rate ranging from 92 to 87.

Abstract

Since Ambros’ discovery of small non-protein coding RNAs in the early 1990s, the past two decades have seen an upsurge in the number of reports of predicted microRNAs (miR), which have been implicated in various functions. The correlation of miRs with cancer has spurred the usage of this class of non-coding RNAs in various cancer therapies, although most of them are at trial stages. However, the experimental identification of a miR to be associated with cancer is still an elaborate, time-consuming process. To aid this process of miR association, we undertook an in-silico study involving the identification of global signatures in experimentally validated microRNAs associated with cancer. Subsequently, a support vector machine based two-step binary classifier system has been trained and modeled from the features extracted from the above study. A total of 60 distinguishing features were selected and ranked to form the feature set for classification – 26 of these extracted from the miR sequence itself, and the remainder from the thermodynamics of folding and the hybridized miRNA–mRNA structure. The two step classifier model – miRSEQ and miRINT had reasonably good performance measures with fairly high values of Matthew’s correlation coefficient (MCC) values ranging from 0.72 to 0.82 (availability: https://sites.google.com/site/sumitslab/tools).

Introduction

miRNA (miR) are small non-coding, single stranded RNAs (about 22 nucleotides in length) involved in several regulatory pathways in the cell cycle. They bind to the untranslated regions (UTRs) of mRNA, (particularly the 3'UTR) and play an important role in the post-transcriptional regulation of gene expression (Bartel, 2004, Filipowicz et al., 2008). Recent studies suggest that these noncoding RNAs can bind to 5'UTRs (Ragan et al., 2009) and coding regions (Hausser et al., 2013) of mRNA as well, but little is known about the mechanism of binding and their regulation. Binding of a miR to a specific target in an UTR with complete complementarity either leads to degradation of the mRNA itself or induce translational repression (Esquela-Kerscher and Slack, 2006). In tissues associated with various tumors, it has been observed that the expression pattern of miRs is altered considerably (Cummins et al., 2006, Zhang et al., 2006). Additionally, gene mapping reveals that most of the human miRs are located in chromosomal positions which are susceptible to rearrangements (Calin and Croce, 2007). Hence, it can be asserted that miRs in humans play a major role in the cancer pathway.

Previous studies by several authors have investigated the involvement of different types of base pairing in miR–mRNA interactions and target prediction algorithms have been formulated based on these precincts. These algorithms predominantly considered Watson Crick base pairing between the miR and its respective mRNA – especially with the 2nd to the 8th nucleotide positions of miR – as the potential target sites. However, in later studies, it was found that animal miRs do not bind to mRNA with perfect complementarity (unlike in plants); rather their binding leaves several imperfections like loops, mismatches or bulges and often involves GU(non-Watson Crick) base pairing as well (Axtell et al., 2011, Didiano and Hobert, 2008). Other than these determinants, AU richness around the seed regions and folding of mRNA play a vital role in target binding (Grimson et al., 2007, Robins et al., 2005). All these factors need to be considered, not in isolation but together to hypothesize miR:mRNA interactions.

Some of the computational methods used in the functional annotation of miRs involved in cancer mainly rely on the expression profile of various cancer cell types and statistical analysis for further classification (Jayaswal et al., 2011). These methods utilize the expression profile but they fail to consider the fact that a single miR can bind to several mRNA target sites and regulate the cell differently. Our aim at feature selection was, therefore, to embrace all these redundancy checks. Other attempts to classify miRs into oncogenes and tumor suppressor genes (TSGs) were based on functional and evolutionary features (Wang et al., 2010) like conservation, expression levels, chromosome distribution, etc.

The present study involved a search and analysis of features involved in the interaction of a miR:mRNA associated with cancer. These features encompassed sequential, hybridization and thermodynamics of validated miR:mRNA interactions only. Based on the curated and prioritized features, we developed a two-step machine based classifier model – miRSEQ and miRINT, which will identify a miR to be associated with cancer and also classify the type of its association, i.e., either with an oncogene or a tumor suppressor. Prioritization of the features and a diversification of the models according to the number of seed regions drastically improved the performance of the classifier, as compared to generalized features and holistic hybridization. The incorporation of seed based classification in the determination of features is a novel approach in our algorithm. The final classifier thus developed had good performance with experimentally validated datasets giving good prediction accuracy (cross validation (cv-rate) ranging from 92% to 87%).

Section snippets

Dataset preparation

For the purpose of generating a classifier, the first step needed to be undertaken is the construction of a microRNA dataset which has been experimentally validated to be associated with cancer. To begin with, a list of genes involved in cancer was downloaded from the catalog of somatic mutations (COSMIC) (Higgins et al., 2007). A total of 488 genes were thus listed, which could be further segregated into oncogenes and tumor suppressors by cross-referring with the tumor associated gene database

Results

Dataset preparation was carried out individually for the classifiers miRSEQ and miRINT (Fig. 1). Consequently, a total of 263 miRs were used in the miRSEQ training. Class imbalance problem in the dataset was overcome by the SMOTE (k-nearest algorithm with no replacement) method which generated sufficient number of negative instances for the training set. Like most SVM classification problems related to miRNAs, our dataset was also not linearly separable as it was too complex in nature. RBF was

Discussion

Identifying miR involvement in cancer is a major obstacle for researchers striving to understand the basis of the disease and to generate new therapies against particular cancer types. miRNAs regulate the molecular pathways in cancer by either upregulating or downregulating various oncogenes and tumor suppressors, and sometimes acting as oncogenes themselves. The functional annotation of miRNAs in cancer is still a painstaking process, though cancer therapies using miRNA has been picking up

Acknowledgements

The authors wish to thank Dr. Ranjit Prasad Bahadur, Indian Institute of Technology – Kharagpur, India for his initial assistance in machine learning approaches. Ram K. was supported by a scholarship from Council of Scientific Research and Industrial Research, Govt. of India.

References (40)

  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • J.S. Chen et al.

    In silico identification of oncogenic potential of fyn-related kinase in hepatocellular carcinoma

    Bioinformatics

    (2013)
  • J.M. Cummins et al.

    The colorectal microRNAome

    PNAS

    (2006)
  • Didiano, D., Hobert, O., 2008. Molecular architecture of a miRNA-regulated 3′ UTR Molecular architecture of a...
  • A. Esquela-Kerscher et al.

    Oncomirs – microRNAs with a role in cancer

    Nat. Rev. Cancer

    (2006)
  • W. Filipowicz et al.

    Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight?

    Nat. Rev. Genet.

    (2008)
  • S. Griffiths-jones et al.

    miRBase: microRNA sequences, targets and gene nomenclature

    Nucleic Acids Res.

    (2006)
  • J. Hausser et al.

    Analysis of CDS-located miRNA target sites suggests that they can effectively inhibit translation

    Genome Res.

    (2013)
  • C. Hebert et al.

    High mobility group A2 is a target for miRNA-98 in head and neck squamous cell carcinoma

    Mol. Cancer

    (2007)
  • M.E. Higgins et al.

    CancerGenes: a gene selection resource for cancer genome projects

    Nucleic Acids Res.

    (2007)
  • Cited by (7)

    View all citing articles on Scopus
    View full text