Target-DBPPred: An intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting
Graphical abstract
Introduction
DNA-protein interaction has a significant role in performing numerous biological processes, including DNA damage, repair, translation, recombination, and translation, etc. [1]. According to a past study, about 2–5% and 6–7% of the prokaryotic and eukaryotic genome directly encodes DBPs [2]. Some of the DBPs are primarily involved in the gene replication and transcription, while other kinds are responsible for shaping chromosomes into proper structure. More importantly, these DBPs transform DNA into a compact shape called chromatin [3]. The study of DNA-protein is greatly important for understanding various human chronic diseases treatment and manufacturing of many drugs designed in pharmaceutical industries. For example, nuclear receptors are mainly used for breast and prostate cancer treatment. Glucocorticoid receptors (a sub-class of DBPs) act as active ingredients of dexamethasone, which are used to treat asthma, allergies, anti-inflammatory conditions, and autoimmune diseases [[4], [5], [6]]. Additionally, nuclear receptors perform a critical role in understanding the physiology of liver cancer [7]. Inhibitor DNA binding (ID) protein has influential functions in tumor-associated developments such as angiogenesis, metastasis, and chemoresistance. Consequently, a GntR-like regulator is utilized in antibiotic production [8]. Some DBPs, such as ZFNs (Zinc-finger nucleases), can be utilized in AIDS/HIV treatment [9].
A series of experimental methods have been established to identify the interaction mechanism of DNA-protein including X-ray crystallography [10], NMR (nuclear magnetic resonance) [11], micro-matrix [12], and genomic analysis [13]. However, these methods have several shortcomings like resource-intensive, laborious, high-cost, and unsatisfactory predictions. The number of protein sequences is rapidly increasing in world protein databases with the development of advanced instruments and methodologies. Therefore, the design of the machine learning-based predictive approaches was highly desired. In this regard, a series of computational systems such as Seq(DNA) [14], iDBPs [15], DNA-Prot [16], DBD-Hunter [17], iDNA-Prot [18], DBPPred [19], iDNA-Prot|dis [20], Kmer1+ACC [21], Local-DPP [22], PseDNA-Pro [23], BindUP [24], PSFM-DBT [25], HMMBinder [26], SVM-PSSM-DT [27], iDNAProt-ES [28], DPP-PseAAC [29], StackDPPred [30], MSFBinder [31], DBPPred-PDSD [1], DP-BINDER [32], and TargetDBP [33] were introduced for identification of DBPs.
DL (deep-learning) based predictors have been presented for DBPs prediction in the recent years. For instance, Qu et al. used 1D CNN (one-dimensional convolutional neural network) in their work. In this method, the encoding layer of CNN maps a primary protein sequence to a fixed-length digital vector. In another DL-based approach, Du et al. used a sequence segmentation strategy and a deep neural network [34]. Each method worked on enhancing the prediction of DBPs. However, most existing predictors extracted the features by simple methods such as AAC, DPC, PSSM, and the models were trained using traditional machine learning models that are usually unable to predict DBPs accurately.
Considering the above-noted limitations, this study introduces a novel predictor named Target-DBPPred. The major contributions of our study are described below.
- 1)
Designed a novel feature encoder (EDF-PSSM-DWT). Instead of computing evolutionary features by simple PSSM, EDF-PSSM is used to consider the sequence-order information and explore the global features. Further, A compression technique (DWT) was incorporated into EDF-PSSM to remove less informative patterns. Thus, the EDF-PSSM-DWT makes the discriminative feature more visible to learning algorithms.
- 2)
LiXGB is used to detect the distinctive features during the model training to improve the prediction performance.
- 3)
Constructing a novel predictor (Target-DBPPred) that predicts DBPs more accurately than existing methods could be fruitful for large-sized datasets.
The architecture of the applied methodologies used in this study has been reported in Fig. 1 as well as detail has been elucidated in the following sections.
Section snippets
Dataset
In this work, two datasets are used from a past study [35]. One dataset (PDB14189) is adopted for training the model and the second dataset (PDB2272) is used as a testing set to assess the model prediction power. The training set is collected from the UniProt database [36]. The CD-HIT toolkit deletes more than 25% of similar sequences. The sequences with length more than 6000 amino acids and less than 50 amino acids are eliminated from both datasets. The final dataset contains 7129 DBPs and
Results and discussion
This section describes the experimental outcomes of feature encoders with machine-learning algorithms. The following sections also elucidate the comparative performance analysis of classifiers and existing approaches.
Conclusion and future work
The role of DBPs interaction is crucial in biological activities, including DNA transcription, repair, and replication. Most DBPs are used in the drugs designed for many human fatal diseases and other pharmaceutical products. This work confirms that the implementation of LiXGB has successfully achieved superlative performance over the available predictors for DBPs prediction. The better results of our novel protocol are due to several factors like the appropriate feature extraction method,
Declaration of competing interest
The authors declare no conflict of interests.
Acknowledgments
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under grant number RGP.1/85/42.
References (69)
- et al.
Variation in form and function: the helix-turn-helix regulators of the GntR superfamily
Adv. Appl. Microbiol.
(2009) - et al.
Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information
Inf. Sci.
(2017) - et al.
DPP-PseAAC: a DNA-binding protein prediction model using Chou's general PseAAC
J. Theor. Biol.
(2018) - et al.
Improving prediction of extracellular matrix proteins using evolutionary information via a grey system model and asymmetric under-sampling technique
Chemometr. Intell. Lab. Syst.
(2018) - et al.
AFP-CMBPred: Computational identification of antifreeze proteins by extending consensus sequences into multi-blocks evolutionary information
Comput. Biol. Med.
(2021) - et al.
iHBP-DeepPSSM: Identifying hormone binding proteins using PsePSSM based evolutionary features and deep learning approach
Chemometr. Intell. Lab. Syst.
(2020) - et al.
Prediction of antitubercular peptides via heterogeneous feature representation and genetic algorithm based ensemble learning model
Comput. Biol. Med.
(2021) - et al.
iAFPs-EnC-GA: identifying antifungal peptides using sequential and evolutionary descriptors based multi-information fusion and ensemble learning approach
Chemometr. Intell. Lab. Syst.
(2022) - et al.
Deep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks
Chemometr. Intell. Lab. Syst.
(2021) - et al.
Application of wavelet entropy in analysis of electrochemical noise for corrosion type identification
Electrochem. Commun.
(2014)