DRPPP: A machine learning based tool for prediction of disease resistance proteins in plants
Introduction
Plants are in a constant onslaught from different pathogenic microorganisms including fungi, virus, bacteria, nematodes and oomycetes [5], [7]. To perceive such disease-related attacks plants possess disease resistance (R) genes, which have the ability to counter attack these pathogens. Unlike animals, which have a defined immune system, plants have developed two defense systems for recognition and response to invading pathogens and pests. The first mechanism is basal defense, where extracellular transmembrane receptors identify pathogen-associated molecular patterns (PAMPs) also known as microbe-associated molecular patterns (MAMPs) [13]. The second mechanism is based on the adaptive immune system, which involves defense layer consisting of a group of genes encoding NBS-LRR proteins and proteins associated with effector-triggered immunity (ETI).
Plant disease resistance proteins recognize avirulence (Avr) genes expressed by the pathogens [9]. Elicitor recognition occurs through R proteins, thereby activating downstream signalling responses. Most R genes in plants consist of nucleotide binding site (NBS) and a leucine-rich repeat (LRR) domain(s) referred to as NBS-LRR genes [12], [27], [9]. NBS proteins are further categorized into two sub-classes based on domains and motifs. NBS-LRR proteins with a N-terminal Toll/Interleukin1 (TIR)-like domain are called TIR-NBS-LRR (TNL) proteins, while those lacking the TIR-like domain are called non-TIR-NBS-LRRs [8]. In non-TIR-NBS-LRRs, the TIR domain may be substituted by a coiled-coil (CC) domain, defining CC-NBS-LRR proteins [1], [16]. The NBS domain is responsible for binding to ATP [23], whereas the C-terminal leucine-rich repeat (LRR) is concerned with binding to pathogen-derived molecules and regulation of signal transduction [16], [21]. The NBS domain contains consensus motifs kinase 1a (P-loop), kinase 2 and kinase 3a which are collectively referred to as the NB subdomain [2], [24]. Further R genes are functionally classified into seven different types based on domains consisting of CNL (CC-NB-LRR), TNL (TIR-NB-LRR), NL (NBS-LRR), RLP (ser/thr-LRR), RLK (Kin-LRR), TN (TIR-NBS) and others [19].
Considering the importance of disease resistance genes, few computational resources for R genes have been developed in the past. The most popular resource PRGdb (Plant Resistance Gene database) is a dedicated repository which hosts 112 reference R proteins and 1,04,335 putative R proteins in 233 plant species [19]. The reference 112 manually curated R proteins have been identified from Aegilops tauschii (1), Arabidopsis thaliana (25), Beta vulgaris (1), Capsicum annuum (2), Capsicum chacoense (1), Cucumis melo (4), Glycine max (3), Helianthus annuus (1), Hordeum vulgare (8), Lactuca sativa (1), Linum usitatissimum (3), Nicotiana benthamiana (2), Nicotiana glutinosa (1), Nicotiana tabacum (1), Oryza sativa (20), Phaseolus vulgaris(1), Solanum acaule (1), Solanum bulbocastanum (2), Solanum demissum (1), Solanum habrochaites (2), Solanum lycopersicum (13), Solanum pimpinellifolium(5), Solanum tuberosum (5), Triticum aestivum (5) and Zea mays (3) species. PRGbdb's predictive pipeline DRAGO involves a domain dependent approach for prediction of R genes, which classifies putative data into CN, CNL, Mlo-like, N, NL, Other, RLK, RLK-GNK2, RLP, RPW8-NL T, TN and TNL classes [19].
Early prediction of R proteins can accelerate the process of genetic improvement and breeding programs for the development of disease-resistant varieties [15]. However, the prediction is complex as plant disease resistance genes, encoding NBS-LRR, have a repetitive nature. Several bioinformatics methods have been previously used for identifying R proteins, which include sequence alignment, BLAST search, phylogenetic analysis, and domain/motif analysis. These methods use various applications such as Pfam (http://pfam.xfam.org/), Hidden Markov Model (HMM) [26], SMART (http://smart.embl-heidelberg.de/) [20], Prosite (http://prosite.expasy.org/), and InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan5/). Existing methods are time-consuming and require the capability of extensive data processing. Recently, two pipelines NLR-parser and NBSPred were introduced for identification of disease resistance proteins. NLR-parser uses motif alignment and search tool (MAST) output to identify TNL or CNL class, whereas NBSPred is a machine learning pipeline for predicting plant resistance proteins [14], [22]. Moreover, the NLR-parser is incompatible for genomic assemblies, whereas NBSPred considers electronically curated datasets, which is not an accurate approach for model building.
The present prediction methods of predicting R proteins are majorly based on sequence similarity or domain-based methods, which may skip some existing unrecognized proteins. These can predict R proteins, which are similar to existing R proteins but have limited prediction accuracy for sequences having less sequence similarity with the known R proteins.
To overcome this problem, there is an instant need for alignment and domain-independent methods. Since the above methods are based on high sequence similarity, certain low similarity proteins cannot be predicted through these methods. A supervised machine learning method, the support vector machine (SVM) algorithm [6], has demonstrated high performance in solving prediction problems in many biomedical fields, especially in bioinformatics [18], [28]. The SVM classifier has confirmed good performance in correctly classifying many datasets in various biomedical and bioinformatics fields [17], [28]. Further, the NGS sequencing technology has the capacity to produce burgeoning amount of data. For early prediction of these important R proteins from large-scale genomic and transcriptome datasets, there is an urgent need to develop a novel, robust and efficient computational method implemented as a tool.
The prediction of disease resistance proteins using machine learning based prediction techniques from plant genomes and transcriptomes can, therefore, provide rapid novel insights about their identification irrespective of domain or alignment-based methods. The present study was implemented through bioinformatics tool based on SVM classifier for predicting R genes in plants. The open access tool is available at http://14.139.240.55/NGS/download.php. The current research aims to provide a novel machine learning tool for prediction of R proteins in plants. The developed method was validated on the test set consisting of R proteins and non-R proteins and produced high accuracy and sensitivity.
Section snippets
Methods
A collection of 112 reference R proteins from 25 plant species was downloaded from the PRGdb database [19] to serve as a positive dataset belonging to seven different domain classes (Table 1). For the negative dataset, all known protein sequences (till date) from the same plant species were obtained from the NCBI protein database. Out of these sequences, 158 protein sequences were selected randomly through in-house developed Perl script. These 158 random protein sequences were manually verified
Evaluation of test datasets
The performance of the model was evaluated using test datasets (sequences of positive and negative datasets not used in the training of the model) consisting of 45 sequences including 22 positive sequences and 23 negative sequences. Evaluation of model is an important component in order to check overfitting of the model on test datasets. The data files from the test datasets were normalized according to the SVM model requirements i.e. between 0 and 1. In order to compare the efficiency of the
Results
In total, 112 positives and 119 negative sequences were collected from PRGdb and protein database maintained by the NCBI (National Center for Biotechnology Information) respectively. The detailed description of the distribution into training and test datasets is represented in Table 2. Total data were divided into 80% for training and about 20% for testing the trained SVM model. A total of 10,270 features were generated using 16 different methods. These features were pre-processed to adjust NA
Discussion
Large scale cultivation of commercial crops has declined in recent years owing to the losses by various biotic stresses caused by fungal, viral, bacterial and other pests. R genes in plant genomes provide resistance against pathogens by translating into R proteins [10]. The identification of R genes will reduce the challenges of chemical inputs leading to further environmental sustainability. Development of disease resistance has always been one of the major objectives for any crop improvement
Conclusions
The influential features for predicting R genes are not known completely, therefore, a set of all important features was generated through feature extraction techniques. The developed method and implemented tool i.e. DRPPP can efficiently predict R protein with highest accuracy (91.11%) as compared to other existing tools. In future, DRPPP will be updated with the inclusion of more R genes discovered, thereby enhancing the efficiency of DRPPP.
Author contributions
TP, VJ and RSC conceptualized and derived this study. TP and VJ tested the models included in this study. RSC helped in drafting and finalizing the manuscript. All authors have contributed to, seen, and approved the manuscript.
Conflict of interest
The authors declare that they have no conflict of interest.
Acknowledgment
The authors are thankful to the Department of Biotechnology, Ministry of Science & Technology (Grant no. BT/01/CEIB/09/V/02), Government of India for providing funds to RSC in the form of a program support on high-value medicinal plants. The authors are thankful to the Jaypee University of Information Technology for providing the necessary research facilities.
References (28)
- et al.
Update on the domain architectures of NLRs and R proteins
Biochem. Biophys. Res. Commun.
(2006) - et al.
Structure, function and evolution of plant disease resistance genes
Curr. Opin. Plant Biol.
(2000) - et al.
Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: an approach from chaos games representation
J. Theor. Biol.
(2011) The genetic architecture of resistance
Curr. Opin. Plant Biol.
(2000)- et al.
Constitutive gain-of‐function mutants in a nucleotide binding site–leucine rich repeat protein encoded at the Rx locus of potato
Plant J.
(2002) - et al.
Application of high-dimensional feature selection: Evaluation for genomic prediction in Man
Sci. Rep.
(2015) - et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2011) - et al.
The Pseudomonas syringae avrRpt2 gene product promotes pathogen virulence from inside plant cells
Mol. Plant-Microbe Interact.
(2000) - et al.
Support-vector networks
Mach. Learn.
(1995) - et al.
Plant pathogens and integrated defense responses to infection
Nature
(2001)
Plant NBS-LRR proteins in pathogen sensing and host defense
Nat. Immunol.
Putative resistance genes in the CitEST database
Genet. Mol. Biol.
The role of leucine-rich repeat proteins in plant defenses
Adv. Bot. Res.
The plant immune system
Nature
Cited by (55)
Rapid ripening stage classification and dry matter prediction of durian pulp using a pushbroom near infrared hyperspectral imaging system
2022, Measurement: Journal of the International Measurement ConfederationCitation Excerpt :For classifying a group of data, SVM attempts to find the optimal hyperplane that can separate the closest samples of two classes, also known as the support vectors, by maximizing the margin. The SVM uses the kernel function that nonlinearly maps input space to higher-dimensional space [54,55]. The hyperparameters used for SVM were the soft-margin constant (C), gamma (γ), and kernel function (radial basis function (RBF) and polynomial function).
PSU-CNN: Prediction of student understanding in the classroom through student facial images using convolutional neural network
2022, Materials Today: ProceedingsCitation Excerpt :Classification of facial images according to facial expression can be considered as a computational image analysis problem. In modern technology, image analysis issued in different fields like medical science (X-ray imaging [19–24], MRI and CT scan)[19,25–27], bioinformatics (identify plant species and DNA sequencing)[19,21,28–30], computer science (face recognition [31–33],object detection [4,34,35], security[36,37], image retrieval, access control, law enforcement surveillance [38,39], identify emotions [3,33,40] and pattern recognition [41–43]) and many more. By decoding facial signals, observers can understand an expresser’s emotion.
Machine learning-based farm risk management: A systematic mapping review
2022, Computers and Electronics in AgricultureCitation Excerpt :Non-visionary sensors (Smith et al., 2020; Tamura et al., 2019), earth (Arshad et al., 2013), climate (Funk et al., 2014), and other (Mochida et al., 2015) data types also employed for ML-FRM. The socio-economic data is the least used data type in ML-FRM (Kakhki et al., 2019; Pal et al., 2016) which is primarily employed in combination with other data types (Arndt et al., 2014; Chen et al., 2020b). MQ8: Which components of risk were addressed in ML-FRM?
Genetic analysis of scab disease resistance in common bean (Phaseolus vulgaris) varieties using GWAS and functional genomics approaches
2024, CABI Agriculture and BioscienceA review of artificial intelligence-assisted omics techniques in plant defense: current trends and future directions
2024, Frontiers in Plant SciencePlant Disease Detection and Severity Assessment Using Image Processing and Deep Learning Techniques
2024, SN Computer Science