Elsevier

Computers in Biology and Medicine

Volume 78, 1 November 2016, Pages 42-48
Computers in Biology and Medicine

DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants

https://doi.org/10.1016/j.compbiomed.2016.09.008Get rights and content

Highlights

  • Functionally validated disease resistance genes and randomly generated negative dataset was used for model building.

  • Almost all available important features were extracted and used for model building making it more robust.

  • High prediction accuracy of the model was observed on the test dataset justifying its high reliability.

  • Novel method is being implemented using Perl programming language and is freely available as a standalone tool.

Abstract

Plant disease outbreak is increasing rapidly around the globe and is a major cause for crop loss worldwide. Plants, in turn, have developed diverse defense mechanisms to identify and evade different pathogenic microorganisms. Early identification of plant disease resistance genes (R genes) can be exploited for crop improvement programs. The present prediction methods are either based on sequence similarity/domain-based methods or electronically annotated sequences, which might miss existing unrecognized proteins or low similarity proteins. Therefore, there is an urgent need to devise a novel machine learning technique to address this problem.

In the current study, a SVM-based tool was developed for prediction of disease resistance proteins in plants. All known disease resistance (R) proteins (112) were taken as a positive set, whereas manually curated negative dataset consisted of 119 non-R proteins. Feature extraction generated 10,270 features using 16 different methods. The ten-fold cross validation was performed to optimize SVM parameters using radial basis function. The model was derived using libSVM and achieved an overall accuracy of 91.11% on the test dataset. The tool was found to be robust and can be used for high-throughput datasets. The current study provides instant identification of R proteins using machine learning approach, in addition to the similarity or domain prediction methods.

Introduction

Plants are in a constant onslaught from different pathogenic microorganisms including fungi, virus, bacteria, nematodes and oomycetes [5], [7]. To perceive such disease-related attacks plants possess disease resistance (R) genes, which have the ability to counter attack these pathogens. Unlike animals, which have a defined immune system, plants have developed two defense systems for recognition and response to invading pathogens and pests. The first mechanism is basal defense, where extracellular transmembrane receptors identify pathogen-associated molecular patterns (PAMPs) also known as microbe-associated molecular patterns (MAMPs) [13]. The second mechanism is based on the adaptive immune system, which involves defense layer consisting of a group of genes encoding NBS-LRR proteins and proteins associated with effector-triggered immunity (ETI).

Plant disease resistance proteins recognize avirulence (Avr) genes expressed by the pathogens [9]. Elicitor recognition occurs through R proteins, thereby activating downstream signalling responses. Most R genes in plants consist of nucleotide binding site (NBS) and a leucine-rich repeat (LRR) domain(s) referred to as NBS-LRR genes [12], [27], [9]. NBS proteins are further categorized into two sub-classes based on domains and motifs. NBS-LRR proteins with a N-terminal Toll/Interleukin1 (TIR)-like domain are called TIR-NBS-LRR (TNL) proteins, while those lacking the TIR-like domain are called non-TIR-NBS-LRRs [8]. In non-TIR-NBS-LRRs, the TIR domain may be substituted by a coiled-coil (CC) domain, defining CC-NBS-LRR proteins [1], [16]. The NBS domain is responsible for binding to ATP [23], whereas the C-terminal leucine-rich repeat (LRR) is concerned with binding to pathogen-derived molecules and regulation of signal transduction [16], [21]. The NBS domain contains consensus motifs kinase 1a (P-loop), kinase 2 and kinase 3a which are collectively referred to as the NB subdomain [2], [24]. Further R genes are functionally classified into seven different types based on domains consisting of CNL (CC-NB-LRR), TNL (TIR-NB-LRR), NL (NBS-LRR), RLP (ser/thr-LRR), RLK (Kin-LRR), TN (TIR-NBS) and others [19].

Considering the importance of disease resistance genes, few computational resources for R genes have been developed in the past. The most popular resource PRGdb (Plant Resistance Gene database) is a dedicated repository which hosts 112 reference R proteins and 1,04,335 putative R proteins in 233 plant species [19]. The reference 112 manually curated R proteins have been identified from Aegilops tauschii (1), Arabidopsis thaliana (25), Beta vulgaris (1), Capsicum annuum (2), Capsicum chacoense (1), Cucumis melo (4), Glycine max (3), Helianthus annuus (1), Hordeum vulgare (8), Lactuca sativa (1), Linum usitatissimum (3), Nicotiana benthamiana (2), Nicotiana glutinosa (1), Nicotiana tabacum (1), Oryza sativa (20), Phaseolus vulgaris(1), Solanum acaule (1), Solanum bulbocastanum (2), Solanum demissum (1), Solanum habrochaites (2), Solanum lycopersicum (13), Solanum pimpinellifolium(5), Solanum tuberosum (5), Triticum aestivum (5) and Zea mays (3) species. PRGbdb's predictive pipeline DRAGO involves a domain dependent approach for prediction of R genes, which classifies putative data into CN, CNL, Mlo-like, N, NL, Other, RLK, RLK-GNK2, RLP, RPW8-NL T, TN and TNL classes [19].

Early prediction of R proteins can accelerate the process of genetic improvement and breeding programs for the development of disease-resistant varieties [15]. However, the prediction is complex as plant disease resistance genes, encoding NBS-LRR, have a repetitive nature. Several bioinformatics methods have been previously used for identifying R proteins, which include sequence alignment, BLAST search, phylogenetic analysis, and domain/motif analysis. These methods use various applications such as Pfam (http://pfam.xfam.org/), Hidden Markov Model (HMM) [26], SMART (http://smart.embl-heidelberg.de/) [20], Prosite (http://prosite.expasy.org/), and InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan5/). Existing methods are time-consuming and require the capability of extensive data processing. Recently, two pipelines NLR-parser and NBSPred were introduced for identification of disease resistance proteins. NLR-parser uses motif alignment and search tool (MAST) output to identify TNL or CNL class, whereas NBSPred is a machine learning pipeline for predicting plant resistance proteins [14], [22]. Moreover, the NLR-parser is incompatible for genomic assemblies, whereas NBSPred considers electronically curated datasets, which is not an accurate approach for model building.

The present prediction methods of predicting R proteins are majorly based on sequence similarity or domain-based methods, which may skip some existing unrecognized proteins. These can predict R proteins, which are similar to existing R proteins but have limited prediction accuracy for sequences having less sequence similarity with the known R proteins.

To overcome this problem, there is an instant need for alignment and domain-independent methods. Since the above methods are based on high sequence similarity, certain low similarity proteins cannot be predicted through these methods. A supervised machine learning method, the support vector machine (SVM) algorithm [6], has demonstrated high performance in solving prediction problems in many biomedical fields, especially in bioinformatics [18], [28]. The SVM classifier has confirmed good performance in correctly classifying many datasets in various biomedical and bioinformatics fields [17], [28]. Further, the NGS sequencing technology has the capacity to produce burgeoning amount of data. For early prediction of these important R proteins from large-scale genomic and transcriptome datasets, there is an urgent need to develop a novel, robust and efficient computational method implemented as a tool.

The prediction of disease resistance proteins using machine learning based prediction techniques from plant genomes and transcriptomes can, therefore, provide rapid novel insights about their identification irrespective of domain or alignment-based methods. The present study was implemented through bioinformatics tool based on SVM classifier for predicting R genes in plants. The open access tool is available at http://14.139.240.55/NGS/download.php. The current research aims to provide a novel machine learning tool for prediction of R proteins in plants. The developed method was validated on the test set consisting of R proteins and non-R proteins and produced high accuracy and sensitivity.

Section snippets

Methods

A collection of 112 reference R proteins from 25 plant species was downloaded from the PRGdb database [19] to serve as a positive dataset belonging to seven different domain classes (Table 1). For the negative dataset, all known protein sequences (till date) from the same plant species were obtained from the NCBI protein database. Out of these sequences, 158 protein sequences were selected randomly through in-house developed Perl script. These 158 random protein sequences were manually verified

Evaluation of test datasets

The performance of the model was evaluated using test datasets (sequences of positive and negative datasets not used in the training of the model) consisting of 45 sequences including 22 positive sequences and 23 negative sequences. Evaluation of model is an important component in order to check overfitting of the model on test datasets. The data files from the test datasets were normalized according to the SVM model requirements i.e. between 0 and 1. In order to compare the efficiency of the

Results

In total, 112 positives and 119 negative sequences were collected from PRGdb and protein database maintained by the NCBI (National Center for Biotechnology Information) respectively. The detailed description of the distribution into training and test datasets is represented in Table 2. Total data were divided into 80% for training and about 20% for testing the trained SVM model. A total of 10,270 features were generated using 16 different methods. These features were pre-processed to adjust NA

Discussion

Large scale cultivation of commercial crops has declined in recent years owing to the losses by various biotic stresses caused by fungal, viral, bacterial and other pests. R genes in plant genomes provide resistance against pathogens by translating into R proteins [10]. The identification of R genes will reduce the challenges of chemical inputs leading to further environmental sustainability. Development of disease resistance has always been one of the major objectives for any crop improvement

Conclusions

The influential features for predicting R genes are not known completely, therefore, a set of all important features was generated through feature extraction techniques. The developed method and implemented tool i.e. DRPPP can efficiently predict R protein with highest accuracy (91.11%) as compared to other existing tools. In future, DRPPP will be updated with the inclusion of more R genes discovered, thereby enhancing the efficiency of DRPPP.

Author contributions

TP, VJ and RSC conceptualized and derived this study. TP and VJ tested the models included in this study. RSC helped in drafting and finalizing the manuscript. All authors have contributed to, seen, and approved the manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgment

The authors are thankful to the Department of Biotechnology, Ministry of Science & Technology (Grant no. BT/01/CEIB/09/V/02), Government of India for providing funds to RSC in the form of a program support on high-value medicinal plants. The authors are thankful to the Jaypee University of Information Technology for providing the necessary research facilities.

References (28)

  • B.J. DeYoung et al.

    Plant NBS-LRR proteins in pathogen sensing and host defense

    Nat. Immunol.

    (2006)
  • S. Guidetti-Gonzalez et al.

    Putative resistance genes in the CitEST database

    Genet. Mol. Biol.

    (2007)
  • D.A. Jones et al.

    The role of leucine-rich repeat proteins in plant defenses

    Adv. Bot. Res.

    (1997)
  • J.D. Jones et al.

    The plant immune system

    Nature

    (2006)
  • Cited by (55)

    • Rapid ripening stage classification and dry matter prediction of durian pulp using a pushbroom near infrared hyperspectral imaging system

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      For classifying a group of data, SVM attempts to find the optimal hyperplane that can separate the closest samples of two classes, also known as the support vectors, by maximizing the margin. The SVM uses the kernel function that nonlinearly maps input space to higher-dimensional space [54,55]. The hyperparameters used for SVM were the soft-margin constant (C), gamma (γ), and kernel function (radial basis function (RBF) and polynomial function).

    • PSU-CNN: Prediction of student understanding in the classroom through student facial images using convolutional neural network

      2022, Materials Today: Proceedings
      Citation Excerpt :

      Classification of facial images according to facial expression can be considered as a computational image analysis problem. In modern technology, image analysis issued in different fields like medical science (X-ray imaging [19–24], MRI and CT scan)[19,25–27], bioinformatics (identify plant species and DNA sequencing)[19,21,28–30], computer science (face recognition [31–33],object detection [4,34,35], security[36,37], image retrieval, access control, law enforcement surveillance [38,39], identify emotions [3,33,40] and pattern recognition [41–43]) and many more. By decoding facial signals, observers can understand an expresser’s emotion.

    • Machine learning-based farm risk management: A systematic mapping review

      2022, Computers and Electronics in Agriculture
      Citation Excerpt :

      Non-visionary sensors (Smith et al., 2020; Tamura et al., 2019), earth (Arshad et al., 2013), climate (Funk et al., 2014), and other (Mochida et al., 2015) data types also employed for ML-FRM. The socio-economic data is the least used data type in ML-FRM (Kakhki et al., 2019; Pal et al., 2016) which is primarily employed in combination with other data types (Arndt et al., 2014; Chen et al., 2020b). MQ8: Which components of risk were addressed in ML-FRM?

    View all citing articles on Scopus
    View full text