DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants

doi:10.1016/j.compbiomed.2016.09.008

Computers in Biology and Medicine

Volume 78, 1 November 2016, Pages 42-48

https://doi.org/10.1016/j.compbiomed.2016.09.008 Get rights and content

Highlights

•
Functionally validated disease resistance genes and randomly generated negative dataset was used for model building.
•
Almost all available important features were extracted and used for model building making it more robust.
•
High prediction accuracy of the model was observed on the test dataset justifying its high reliability.
•
Novel method is being implemented using Perl programming language and is freely available as a standalone tool.

Abstract

Plant disease outbreak is increasing rapidly around the globe and is a major cause for crop loss worldwide. Plants, in turn, have developed diverse defense mechanisms to identify and evade different pathogenic microorganisms. Early identification of plant disease resistance genes (R genes) can be exploited for crop improvement programs. The present prediction methods are either based on sequence similarity/domain-based methods or electronically annotated sequences, which might miss existing unrecognized proteins or low similarity proteins. Therefore, there is an urgent need to devise a novel machine learning technique to address this problem.

In the current study, a SVM-based tool was developed for prediction of disease resistance proteins in plants. All known disease resistance (R) proteins (112) were taken as a positive set, whereas manually curated negative dataset consisted of 119 non-R proteins. Feature extraction generated 10,270 features using 16 different methods. The ten-fold cross validation was performed to optimize SVM parameters using radial basis function. The model was derived using libSVM and achieved an overall accuracy of 91.11% on the test dataset. The tool was found to be robust and can be used for high-throughput datasets. The current study provides instant identification of R proteins using machine learning approach, in addition to the similarity or domain prediction methods.

Introduction

Plants are in a constant onslaught from different pathogenic microorganisms including fungi, virus, bacteria, nematodes and oomycetes [5], [7]. To perceive such disease-related attacks plants possess disease resistance (R) genes, which have the ability to counter attack these pathogens. Unlike animals, which have a defined immune system, plants have developed two defense systems for recognition and response to invading pathogens and pests. The first mechanism is basal defense, where extracellular transmembrane receptors identify pathogen-associated molecular patterns (PAMPs) also known as microbe-associated molecular patterns (MAMPs) [13]. The second mechanism is based on the adaptive immune system, which involves defense layer consisting of a group of genes encoding NBS-LRR proteins and proteins associated with effector-triggered immunity (ETI).

Plant disease resistance proteins recognize avirulence (Avr) genes expressed by the pathogens [9]. Elicitor recognition occurs through R proteins, thereby activating downstream signalling responses. Most R genes in plants consist of nucleotide binding site (NBS) and a leucine-rich repeat (LRR) domain(s) referred to as NBS-LRR genes [12], [27], [9]. NBS proteins are further categorized into two sub-classes based on domains and motifs. NBS-LRR proteins with a N-terminal Toll/Interleukin1 (TIR)-like domain are called TIR-NBS-LRR (TNL) proteins, while those lacking the TIR-like domain are called non-TIR-NBS-LRRs [8]. In non-TIR-NBS-LRRs, the TIR domain may be substituted by a coiled-coil (CC) domain, defining CC-NBS-LRR proteins [1], [16]. The NBS domain is responsible for binding to ATP [23], whereas the C-terminal leucine-rich repeat (LRR) is concerned with binding to pathogen-derived molecules and regulation of signal transduction [16], [21]. The NBS domain contains consensus motifs kinase 1a (P-loop), kinase 2 and kinase 3a which are collectively referred to as the NB subdomain [2], [24]. Further R genes are functionally classified into seven different types based on domains consisting of CNL (CC-NB-LRR), TNL (TIR-NB-LRR), NL (NBS-LRR), RLP (ser/thr-LRR), RLK (Kin-LRR), TN (TIR-NBS) and others [19].

Considering the importance of disease resistance genes, few computational resources for R genes have been developed in the past. The most popular resource PRGdb (Plant Resistance Gene database) is a dedicated repository which hosts 112 reference R proteins and 1,04,335 putative R proteins in 233 plant species [19]. The reference 112 manually curated R proteins have been identified from Aegilops tauschii (1), Arabidopsis thaliana (25), Beta vulgaris (1), Capsicum annuum (2), Capsicum chacoense (1), Cucumis melo (4), Glycine max (3), Helianthus annuus (1), Hordeum vulgare (8), Lactuca sativa (1), Linum usitatissimum (3), Nicotiana benthamiana (2), Nicotiana glutinosa (1), Nicotiana tabacum (1), Oryza sativa (20), Phaseolus vulgaris(1), Solanum acaule (1), Solanum bulbocastanum (2), Solanum demissum (1), Solanum habrochaites (2), Solanum lycopersicum (13), Solanum pimpinellifolium(5), Solanum tuberosum (5), Triticum aestivum (5) and Zea mays (3) species. PRGbdb's predictive pipeline DRAGO involves a domain dependent approach for prediction of R genes, which classifies putative data into CN, CNL, Mlo-like, N, NL, Other, RLK, RLK-GNK2, RLP, RPW8-NL T, TN and TNL classes [19].

Early prediction of R proteins can accelerate the process of genetic improvement and breeding programs for the development of disease-resistant varieties [15]. However, the prediction is complex as plant disease resistance genes, encoding NBS-LRR, have a repetitive nature. Several bioinformatics methods have been previously used for identifying R proteins, which include sequence alignment, BLAST search, phylogenetic analysis, and domain/motif analysis. These methods use various applications such as Pfam (http://pfam.xfam.org/), Hidden Markov Model (HMM) [26], SMART (http://smart.embl-heidelberg.de/) [20], Prosite (http://prosite.expasy.org/), and InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan5/). Existing methods are time-consuming and require the capability of extensive data processing. Recently, two pipelines NLR-parser and NBSPred were introduced for identification of disease resistance proteins. NLR-parser uses motif alignment and search tool (MAST) output to identify TNL or CNL class, whereas NBSPred is a machine learning pipeline for predicting plant resistance proteins [14], [22]. Moreover, the NLR-parser is incompatible for genomic assemblies, whereas NBSPred considers electronically curated datasets, which is not an accurate approach for model building.

The present prediction methods of predicting R proteins are majorly based on sequence similarity or domain-based methods, which may skip some existing unrecognized proteins. These can predict R proteins, which are similar to existing R proteins but have limited prediction accuracy for sequences having less sequence similarity with the known R proteins.

To overcome this problem, there is an instant need for alignment and domain-independent methods. Since the above methods are based on high sequence similarity, certain low similarity proteins cannot be predicted through these methods. A supervised machine learning method, the support vector machine (SVM) algorithm [6], has demonstrated high performance in solving prediction problems in many biomedical fields, especially in bioinformatics [18], [28]. The SVM classifier has confirmed good performance in correctly classifying many datasets in various biomedical and bioinformatics fields [17], [28]. Further, the NGS sequencing technology has the capacity to produce burgeoning amount of data. For early prediction of these important R proteins from large-scale genomic and transcriptome datasets, there is an urgent need to develop a novel, robust and efficient computational method implemented as a tool.

The prediction of disease resistance proteins using machine learning based prediction techniques from plant genomes and transcriptomes can, therefore, provide rapid novel insights about their identification irrespective of domain or alignment-based methods. The present study was implemented through bioinformatics tool based on SVM classifier for predicting R genes in plants. The open access tool is available at http://14.139.240.55/NGS/download.php. The current research aims to provide a novel machine learning tool for prediction of R proteins in plants. The developed method was validated on the test set consisting of R proteins and non-R proteins and produced high accuracy and sensitivity.

Section snippets

Methods

A collection of 112 reference R proteins from 25 plant species was downloaded from the PRGdb database [19] to serve as a positive dataset belonging to seven different domain classes (Table 1). For the negative dataset, all known protein sequences (till date) from the same plant species were obtained from the NCBI protein database. Out of these sequences, 158 protein sequences were selected randomly through in-house developed Perl script. These 158 random protein sequences were manually verified

Evaluation of test datasets

The performance of the model was evaluated using test datasets (sequences of positive and negative datasets not used in the training of the model) consisting of 45 sequences including 22 positive sequences and 23 negative sequences. Evaluation of model is an important component in order to check overfitting of the model on test datasets. The data files from the test datasets were normalized according to the SVM model requirements i.e. between 0 and 1. In order to compare the efficiency of the

Results

In total, 112 positives and 119 negative sequences were collected from PRGdb and protein database maintained by the NCBI (National Center for Biotechnology Information) respectively. The detailed description of the distribution into training and test datasets is represented in Table 2. Total data were divided into 80% for training and about 20% for testing the trained SVM model. A total of 10,270 features were generated using 16 different methods. These features were pre-processed to adjust NA

Discussion

Large scale cultivation of commercial crops has declined in recent years owing to the losses by various biotic stresses caused by fungal, viral, bacterial and other pests. R genes in plant genomes provide resistance against pathogens by translating into R proteins [10]. The identification of R genes will reduce the challenges of chemical inputs leading to further environmental sustainability. Development of disease resistance has always been one of the major objectives for any crop improvement

Conclusions

The influential features for predicting R genes are not known completely, therefore, a set of all important features was generated through feature extraction techniques. The developed method and implemented tool i.e. DRPPP can efficiently predict R protein with highest accuracy (91.11%) as compared to other existing tools. In future, DRPPP will be updated with the inclusion of more R genes discovered, thereby enhancing the efficiency of DRPPP.

Author contributions

TP, VJ and RSC conceptualized and derived this study. TP and VJ tested the models included in this study. RSC helped in drafting and finalizing the manuscript. All authors have contributed to, seen, and approved the manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgment

The authors are thankful to the Department of Biotechnology, Ministry of Science & Technology (Grant no. BT/01/CEIB/09/V/02), Government of India for providing funds to RSC in the form of a program support on high-value medicinal plants. The authors are thankful to the Jaypee University of Information Technology for providing the necessary research facilities.

References (28)

M. Albrecht et al.
Update on the domain architectures of NLRs and R proteins
Biochem. Biophys. Res. Commun.
(2006)
J. Ellis et al.
Structure, function and evolution of plant disease resistance genes
Curr. Opin. Plant Biol.
(2000)
X. Jingbo et al.
Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: an approach from chaos games representation
J. Theor. Biol.
(2011)
N.D. Young
The genetic architecture of resistance
Curr. Opin. Plant Biol.
(2000)
A. Bendahmane et al.
Constitutive gain-of‐function mutants in a nucleotide binding site–leucine rich repeat protein encoded at the Rx locus of potato
Plant J.
(2002)
M. Bermingham et al.
Application of high-dimensional feature selection: Evaluation for genomic prediction in Man
Sci. Rep.
(2015)
C.C. Chang et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2011)
Z. Chen et al.
The Pseudomonas syringae avrRpt2 gene product promotes pathogen virulence from inside plant cells
Mol. Plant-Microbe Interact.
(2000)
C. Cortes et al.
Support-vector networks
Mach. Learn.
(1995)
J.L. Dangl et al.
Plant pathogens and integrated defense responses to infection
Nature
(2001)

B.J. DeYoung et al.

Plant NBS-LRR proteins in pathogen sensing and host defense

Nat. Immunol.

(2006)

S. Guidetti-Gonzalez et al.

Putative resistance genes in the CitEST database

Genet. Mol. Biol.

(2007)

D.A. Jones et al.

The role of leucine-rich repeat proteins in plant defenses

Adv. Bot. Res.

(1997)

J.D. Jones et al.

The plant immune system

Nature

(2006)

Cited by (55)

Rapid ripening stage classification and dry matter prediction of durian pulp using a pushbroom near infrared hyperspectral imaging system
2022, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
For classifying a group of data, SVM attempts to find the optimal hyperplane that can separate the closest samples of two classes, also known as the support vectors, by maximizing the margin. The SVM uses the kernel function that nonlinearly maps input space to higher-dimensional space [54,55]. The hyperparameters used for SVM were the soft-margin constant (C), gamma (γ), and kernel function (radial basis function (RBF) and polynomial function).
This research examined the potential of a pushbroom near infrared hyperspectral imaging (NIR-HSI) system (900–1600 nm) for ripening stage (unripe, ripe, and overripe) classification based on the days after anthesis (DAA) and dry matter (DM) prediction of durian pulp. The performance of five supervised machine learning classifiers was compared including support vector machines (SVM), random forest (RF), linear discriminant analysis (LDA) partial least squares-discriminant analysis (PLS-DA), and k-nearest neighbors (kNN) for the ripening stage classification and a partial least squares regression (PLSR) model was developed for the DM prediction. The classification and regression models were developed and compared using the full and selected wavelengths by genetic algorithms (GA) and principal component analysis (PCA). For classification, LDA showed the best result with a test accuracy of 100% for both full wavelength and selected 135 wavelengths by GA. A total of 11 wavelengths selected from PCA achieved a test accuracy of 93.6% by LDA. The PLSR models predicted the DM with the coefficient of determination of prediction (R_p²) greater than 0.80 and a root mean square error of prediction (RMSEP) less than 1.6%. The results show that NIR-HSI has the potential to identify ripeness correctly, predict the DM and visualize the spatial distribution of durian pulp. This approach can be implemented in the packaging firms to solve the problems related to uneven ripeness and to inspect the quality of durian based on DM content.
PSU-CNN: Prediction of student understanding in the classroom through student facial images using convolutional neural network
2022, Materials Today: Proceedings
Citation Excerpt :
Classification of facial images according to facial expression can be considered as a computational image analysis problem. In modern technology, image analysis issued in different fields like medical science (X-ray imaging [19–24], MRI and CT scan)[19,25–27], bioinformatics (identify plant species and DNA sequencing)[19,21,28–30], computer science (face recognition [31–33],object detection [4,34,35], security[36,37], image retrieval, access control, law enforcement surveillance [38,39], identify emotions [3,33,40] and pattern recognition [41–43]) and many more. By decoding facial signals, observers can understand an expresser’s emotion.
Facial expressions are a set of symbols of great importance for human-to-human communication. This communication can be assisted/optimized through use of artificial intelligence. Student-teacher interaction is important human-to-human communication. Spontaneous in nature, diverse and personal, facial expressions demand real-time, complex, robust and adaptable student’s facial expression recognition systems to facilitate the student-teacher interaction. This paper investigates the idea of performing automated analysis of a student’s understanding of the class participating in active face-to-face classroom teaching. Every student can understand the lecture which is crucial in classroom teaching. In the current study the facial images of the student were collected during the lecture. The facial images were tagged with help of student according to their understanding of lecture at that particular moment. This problem pursued by analyzing the faces of students attending the class and classifying the facial image as “Understanding” or “Not Understanding” in the lecture. Different machine learning (ML) methods were used for development of the model for the classification and the proposed deep learning framework (convolutional neural network (CNN)) outperformed the other methods i.e. Support Vector Machine, Naive Bayes classifier, achieving the test accuracy of 92%, Current work concludes that deep learning like CNN can be the better method for the classification of facial images as compared to other ML methods such as SVM and Naive Bayes for student understanding. Further high accuracy of the methods justifies its importance in optimization of classroom teaching.
Machine learning-based farm risk management: A systematic mapping review
2022, Computers and Electronics in Agriculture
Citation Excerpt :
Non-visionary sensors (Smith et al., 2020; Tamura et al., 2019), earth (Arshad et al., 2013), climate (Funk et al., 2014), and other (Mochida et al., 2015) data types also employed for ML-FRM. The socio-economic data is the least used data type in ML-FRM (Kakhki et al., 2019; Pal et al., 2016) which is primarily employed in combination with other data types (Arndt et al., 2014; Chen et al., 2020b). MQ8: Which components of risk were addressed in ML-FRM?
Farms face various risks such as uncertainties in the natural growth process, obtaining adequate financing, volatile input and output prices, unpredictable changes in farm-related policy and regulations, and farmers‘ personal health problems. Accordingly, farmers have to make decisions to be prepared for such situations under risk or mitigate their impacts to maintain essential functions. Increasingly, a data-driven perspective is warranted where machine learning (ML) has become an essential tool for automatic extraction of useful information to support decision-making in farm management as well as risk management. ML’s role in farm risk management (FRM) has recently increased with advances in technology and digitalization. This paper provides a literature review in the form of a systematic mapping study to identify the publications, trends, active research communities, and detailed reviews on the use of ML methods for FRM. Accordingly, nine research/mapping questions are designed to extract the required information. In total, we retrieved 1819 papers, of which 746 papers were selected based on the defined exclusion criteria for a detailed review. We categorized the studies based on the addressed risk types (e.g., production risk), assessments that addressed risk components (e.g., resilience), used ML types (e.g., supervised learning) and algorithms ranging from regression modeling to deep learning, addressed ML tasks (e.g., classification), data types (e.g., images), and farm types (e.g., crop-based farm). The results reveal that there is a significant increase in employing ML methods including deep learning and convolutional neural networks for FRM in recent years. The production risk and impact/damage assessment are the most frequently addressed risk type and assessment that addressed risk components in ML-FRM, respectively. In addition, research gaps and open problems are identified and accordingly insights and recommendations from risk management and machine learning perspectives are provided for future studies including the need for ML methods for different risk types (e.g., financial risk), assessments addressing different risk components (e.g., resilience assessment), and developing more advanced ML methods (e.g., reinforcement learning) for FRM.
Genetic analysis of scab disease resistance in common bean (Phaseolus vulgaris) varieties using GWAS and functional genomics approaches
2024, CABI Agriculture and Bioscience
A review of artificial intelligence-assisted omics techniques in plant defense: current trends and future directions
2024, Frontiers in Plant Science
Plant Disease Detection and Severity Assessment Using Image Processing and Deep Learning Techniques
2024, SN Computer Science

View all citing articles on Scopus

View full text

DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants

Highlights

Abstract

Introduction

Section snippets

Methods

Evaluation of test datasets

Results

Discussion

Conclusions

Author contributions

Conflict of interest

Acknowledgment

Biochem. Biophys. Res. Commun.

Curr. Opin. Plant Biol.

J. Theor. Biol.

Curr. Opin. Plant Biol.

Constitutive gain-of‐function mutants in a nucleotide binding site–leucine rich repeat protein encoded at the Rx locus of potato

Plant J.

Application of high-dimensional feature selection: Evaluation for genomic prediction in Man

Sci. Rep.

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol. (TIST)

The Pseudomonas syringae avrRpt2 gene product promotes pathogen virulence from inside plant cells

Mol. Plant-Microbe Interact.

Support-vector networks

Mach. Learn.

Plant pathogens and integrated defense responses to infection

Nature

Plant NBS-LRR proteins in pathogen sensing and host defense

Nat. Immunol.

Putative resistance genes in the CitEST database

Genet. Mol. Biol.

The role of leucine-rich repeat proteins in plant defenses

Adv. Bot. Res.

The plant immune system

Nature