Target-DBPPred: An intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting

https://doi.org/10.1016/j.compbiomed.2022.105533Get rights and content

Highlights

  • Designed a novel predictor named Target-DBPPred for prediction of DNA-binding proteins.

  • The features are explored by EDF-PSSM-DWT, F-PSSM, PSSM-DPC, and Lead-BiPSSM.

  • The classification is performed by LiXGB, XGB, and ERT.

  • Target-DBPPred has secured the highest prediction results.

Abstract

DNA-protein interaction is a critical biological process that performs influential activities, including DNA transcription and recombination. DBPs (DNA-binding proteins) are closely associated with different kinds of human diseases (asthma, cancer, and AIDS), while some of the DBPs are used in the production of antibiotics, steroids, and anti-inflammatories. Several methods have been reported for the prediction of DBPs. However, a more intelligent method is still highly desirable for the accurate prediction of DBPs. This study presents an intelligent computational method, Target-DBPPred, to improve DBPs prediction. Important features from primary protein sequences are investigated via a novel feature descriptor, called EDF-PSSM-DWT (Evolutionary difference formula position-specific scoring matrix-discrete wavelet transform) and several other multi-evolutionary methods, including F-PSSM (Filtered position-specific scoring matrix), EDF-PSSM (Evolutionary difference formula position-specific scoring matrix), PSSM-DPC (Position-specific scoring matrix-dipeptide composition), and Lead-BiPSSM (Lead-bigram-position specific scoring matrix) to encapsulate diverse multivariate features. The best feature set from the features of each descriptor is selected using sequential forward selection (SFS). Further, four models are trained using Adaboost, XGB (eXtreme gradient boosting), ERT (extremely randomized trees), and LiXGB (Light eXtreme gradient boosting) classifiers. LiXGB, with the best feature set of EDF-PSSM-DWT, has attained 6.69% and 15.07% higher performance in terms of accuracies using training and testing datasets, respectively. The obtained results verify the improved performance of our proposed predictor over the existing predictors.

Introduction

DNA-protein interaction has a significant role in performing numerous biological processes, including DNA damage, repair, translation, recombination, and translation, etc. [1]. According to a past study, about 2–5% and 6–7% of the prokaryotic and eukaryotic genome directly encodes DBPs [2]. Some of the DBPs are primarily involved in the gene replication and transcription, while other kinds are responsible for shaping chromosomes into proper structure. More importantly, these DBPs transform DNA into a compact shape called chromatin [3]. The study of DNA-protein is greatly important for understanding various human chronic diseases treatment and manufacturing of many drugs designed in pharmaceutical industries. For example, nuclear receptors are mainly used for breast and prostate cancer treatment. Glucocorticoid receptors (a sub-class of DBPs) act as active ingredients of dexamethasone, which are used to treat asthma, allergies, anti-inflammatory conditions, and autoimmune diseases [[4], [5], [6]]. Additionally, nuclear receptors perform a critical role in understanding the physiology of liver cancer [7]. Inhibitor DNA binding (ID) protein has influential functions in tumor-associated developments such as angiogenesis, metastasis, and chemoresistance. Consequently, a GntR-like regulator is utilized in antibiotic production [8]. Some DBPs, such as ZFNs (Zinc-finger nucleases), can be utilized in AIDS/HIV treatment [9].

A series of experimental methods have been established to identify the interaction mechanism of DNA-protein including X-ray crystallography [10], NMR (nuclear magnetic resonance) [11], micro-matrix [12], and genomic analysis [13]. However, these methods have several shortcomings like resource-intensive, laborious, high-cost, and unsatisfactory predictions. The number of protein sequences is rapidly increasing in world protein databases with the development of advanced instruments and methodologies. Therefore, the design of the machine learning-based predictive approaches was highly desired. In this regard, a series of computational systems such as Seq(DNA) [14], iDBPs [15], DNA-Prot [16], DBD-Hunter [17], iDNA-Prot [18], DBPPred [19], iDNA-Prot|dis [20], Kmer1+ACC [21], Local-DPP [22], PseDNA-Pro [23], BindUP [24], PSFM-DBT [25], HMMBinder [26], SVM-PSSM-DT [27], iDNAProt-ES [28], DPP-PseAAC [29], StackDPPred [30], MSFBinder [31], DBPPred-PDSD [1], DP-BINDER [32], and TargetDBP [33] were introduced for identification of DBPs.

DL (deep-learning) based predictors have been presented for DBPs prediction in the recent years. For instance, Qu et al. used 1D CNN (one-dimensional convolutional neural network) in their work. In this method, the encoding layer of CNN maps a primary protein sequence to a fixed-length digital vector. In another DL-based approach, Du et al. used a sequence segmentation strategy and a deep neural network [34]. Each method worked on enhancing the prediction of DBPs. However, most existing predictors extracted the features by simple methods such as AAC, DPC, PSSM, and the models were trained using traditional machine learning models that are usually unable to predict DBPs accurately.

Considering the above-noted limitations, this study introduces a novel predictor named Target-DBPPred. The major contributions of our study are described below.

  • 1)

    Designed a novel feature encoder (EDF-PSSM-DWT). Instead of computing evolutionary features by simple PSSM, EDF-PSSM is used to consider the sequence-order information and explore the global features. Further, A compression technique (DWT) was incorporated into EDF-PSSM to remove less informative patterns. Thus, the EDF-PSSM-DWT makes the discriminative feature more visible to learning algorithms.

  • 2)

    LiXGB is used to detect the distinctive features during the model training to improve the prediction performance.

  • 3)

    Constructing a novel predictor (Target-DBPPred) that predicts DBPs more accurately than existing methods could be fruitful for large-sized datasets.

The architecture of the applied methodologies used in this study has been reported in Fig. 1 as well as detail has been elucidated in the following sections.

Section snippets

Dataset

In this work, two datasets are used from a past study [35]. One dataset (PDB14189) is adopted for training the model and the second dataset (PDB2272) is used as a testing set to assess the model prediction power. The training set is collected from the UniProt database [36]. The CD-HIT toolkit deletes more than 25% of similar sequences. The sequences with length more than 6000 amino acids and less than 50 amino acids are eliminated from both datasets. The final dataset contains 7129 DBPs and

Results and discussion

This section describes the experimental outcomes of feature encoders with machine-learning algorithms. The following sections also elucidate the comparative performance analysis of classifiers and existing approaches.

Conclusion and future work

The role of DBPs interaction is crucial in biological activities, including DNA transcription, repair, and replication. Most DBPs are used in the drugs designed for many human fatal diseases and other pharmaceutical products. This work confirms that the implementation of LiXGB has successfully achieved superlative performance over the available predictors for DBPs prediction. The better results of our novel protocol are due to several factors like the appropriate feature extraction method,

Declaration of competing interest

The authors declare no conflict of interests.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under grant number RGP.1/85/42.

References (69)

  • X. Wang et al.

    Determination of corrosion type by wavelet-based fractal dimension from electrochemical noise

    Int. J. Electrochem. Sci.

    (2013)
  • O. Barukab et al.

    DBP-CNN: Deep learning-based prediction of DNA-binding proteins by coupling discrete cosine transform with two-dimensional convolutional neural network

    Expert Syst. Appl.

    (2022)
  • B. Yu et al.

    Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition

    Chemometr. Intell. Lab. Syst.

    (2017)
  • A. Ghulam et al.

    Accurate prediction of immunoglobulin proteins using machine learning model

    Inform. Med. Unlocked

    (2022)
  • F. Ali et al.

    Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space

    J. Theor. Biol.

    (2016)
  • A. Sharma et al.

    A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition

    J. Theor. Biol.

    (2013)
  • J. Zahiri et al.

    PPIevo: protein–protein interaction prediction from PSSM based evolutionary information

    Genomics

    (2013)
  • F. Ali et al.

    DBPPred-PDSD: Machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space

    Chemometr. Intell. Lab. Syst.

    (2018)
  • B. Remeseiro et al.

    A review of feature selection methods in medical applications

    Comput. Biol. Med.

    (2019)
  • M. Pahar et al.

    COVID-19 cough classification using machine learning and global smartphone recordings

    Comput. Biol. Med.

    (2021)
  • Z.U. Khan et al.

    iPredCNC: computational prediction model for cancerlectins and non-cancerlectins using novel cascade features subset selection

    Chemometr. Intell. Lab. Syst.

    (2019)
  • F. Ali et al.

    Classification of membrane protein types using voting feature interval in combination with chou׳ s pseudo amino acid composition

    J. Theor. Biol.

    (2015)
  • Z.N.K. Swati et al.

    Brain tumor classification for MR images using transfer learning and fine-tuning

    Comput. Med. Imag. Graph.

    (2019)
  • Z.U. Khan et al.

    iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components

    Chemometr. Intell. Lab. Syst.

    (2019)
  • S. Ahmed et al.

    An integrated feature selection algorithm for cancer classification using gene expression data

    Comb. Chem. High Throughput Screen.

    (2018)
  • N.M. Luscombe et al.

    An overview of the structures of protein-DNA complexes

    Genome Biol.

    (2000)
  • K. Sandman et al.

    Diversity of prokaryotic chromosomal proteins and the origin of the nucleosome

    Cell. Mol. Life Sci. CMLS

    (1998)
  • B. Al-Lazikani et al.

    How many drug targets are there

    Nat. Rev. Drug Discov.

    (2006)
  • H. Gronemeyer et al.

    Principles for modulation of the nuclear receptor superfamily

    Nat. Rev. Drug Discov.

    (2004)
  • W.H. Hudson et al.

    Cryptic glucocorticoid receptor-binding sites pervade genomic NF-κB response elements

    Nat. Commun.

    (2018)
  • M. Tran et al.

    Nuclear receptors and liver disease: summary of the 2017 basic research symposium

    Hepatol. Commun.

    (2018)
  • P. Tebas et al.

    Gene editing of CCR5 in autologous CD4 T cells of persons infected with HIV

    N. Engl. J. Med.

    (2014)
  • R. Jaiswal et al.

    Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1–Ter DNA complex

    Acta Crystallogr. F: Struct. Biol. Commun.

    (2015)
  • J.G. Omichinski et al.

    NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1

    Science

    (1993)
  • Cited by (0)

    View full text