iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength
Introduction
Promoter is a DNA sequence recognized, combined and transcribed by RNA polymerase, like a “switch” and determines gene activity [1]. However, the promoter itself does not control gene activity but controls gene activity by binding to a protein called transcription factor. Transcription factor acts as a “flag” to direct the activities of enzymes (RNA polymerase) [2]. This enzyme produces RNA copies of genes, which are generally divided into broad-spectrum expression promoter, tissue-specific promoter, tumor-specific promoter and other forms. Mutations in the promoter of a gene result in the regulation of gene expression, which is common in malignant tumors. In bacteria, σ factors can lead RNA holoenzyme to decide transcription initiation site, which is the combination of RNA polymerase core enzyme and regulatory protein. In E. coli, there are several σ factors, such as σ24,σ28,σ32,σ38,σ54 and σ70, each of which has a specific function [2], [3], [4]. Consequently, identifying promoters and their strength is of great significance.
Traditional methods such as CHIP-seq and RNA-seq are difficult, time-consuming and expensive to identify promoter. Moreover, the explosive growth of biological sequence makes it necessary to develop some computational models. At present, there are some researches on the statistical model for promoter. For example, Reese [5] used time-delay neural network for promoter prediction (NNPP), and the model gave a recognition rate of 75%. In 2005, Florquin et al. [6] used the Large-scale structural characteristics (LSSC) of nucleotides to analysis core promoters, and the overall classification power was close to 80%. In the same year, Burden et al. [7] created a new technology called TSL-NNPP by combining the consistent empirical probability distribution information of E. coli promoters with the results from NNPP2.2, and the probability of misprediction decreased to 52%. In 2006, Li et al. [8] proposed a position-correlation scoring matrix (PCSM) algorithm to identify σ70 promoters by sequence conservative analysis, and the overall prediction sensitivity and specificity were 91% and 81% respectively. In 2009, using the relative stability of DNA as a general criterion for promoters’ prediction was proposed by Rangannan et al., and the model was named PromPredict tool [9]. The results showed that precision values of 58% in E. coli and 60% in B. subtilis. In 2012, a model based on a new variable-window Z-curve and partial least squares technique (VWZ-PLST) was developed by Song [10] to identify core promoters. The accuracy was more than 90% on each sub dataset, which implied that the model VWZ-PLST exceeded all previous methods. In 2014, S. Silva et al. [11] applied DNA double strand stability (DBSS) as a distinguishing feature to identify σ54 and σ28 promoters, and obtained accuracies of 78.8% and 80%, respectively. In the same year, Lin et al. [12] developed a predictor named iPro54-PseKNC, and the accuracy of 93.79% was achieved for identifying σ54 promoters. In 2018, Liu et al. [2] set up a two-layer predictor, which is the first model called iPromoter-2 L for identifying promoters and their strength. The accuracy of the first layer was 81.68%, and the accuracy of the second layer was between 80% and 95%. In 2019, a two-layer model iPSW(2 L)-PseKNC was built up by Xiao et al. [3] to identify promoters and their strength. The accuracy of Layer-1 was 83.13% and the accuracy of Layer-2 was 71.20% on the benchmark dataset. In the same year, Le et al. [13] classified promoters by deep learning and combination of continuous FastText N-Grams (DL-CFTNG), and this method got accuracy of 85.41% and 73.1% in the two layers. Recently, Liang et al. developed a new model called iPromoter-ET [14] to study promoters by multiple features extraction methods and extremely randomized trees for selecting optimal features. The accuracy of the 1st-layer of iPromoter-ET reached 85.14%, and the accuracy of 2nd layer was 72.59%. In addition, Table 1 shows the data information used in the above model, and also attaches the data source by inserting a hyperlink.
Researchers have made a lot of contributions in identifying promoters, and the prediction accuracy has been improved, but there are still some problems. There are few models that can identify promoters and their strength, most of which can only identify promoters but not their types. Besides, most models lack independent data sets to test models except [3]. Therefore, it is of great significance to develop a convincing model that can more accurately predict promoters and their strength.
The deep learning proposed by Hinton et al. [15] is a new field of machine learning research, which is inspired by artificial neural network. Deep learning is a neural network that combines low-level features to form more abstract high-level features to discover the distributed feature representation of data, and simulates the human brain for analysis and learning. Deep learning algorithms can be divided into three categories: convolutional neural network, recursive neural network and generative countermeasure network. They have been successfully applied in computer vision, speech recognition, natural language processing and other fields [16], [17], [18], [19], [20], [21]. In recent years, the use of deep learning for biological sequence recognition has achieved satisfactory results. [22], [23], [24], [25], [26].
Based on previous studies about promoter and generative adversarial networks, a novel model containing two layers, named iPro-GAN, is established to identify promoters and their strength by Moran-based spatial auto-cross correlation for feature extracting and generative adversarial networks based on deep convolution for classifying. Most importantly, the model has achieved the best results so far. Fig. 1 gives the process of how iPro-GAN works.
Section snippets
Dataset
The first step in developing a reliable prediction model is to construct a high-quality benchmark dataset. All the core promoter sequences are obtained from Xiao et al.’s study [3], which can be formulated aswhere S+ contains 3382 promoters, S− contains 3382 non-promoters, contains 1591 strong promoters, contains 1792 weak promoters, and ∪ means union which is a mathematical operator. Those sample sequences with high similarity have been removed by
Optimal features
In feature extraction, Moran based spatial auto-cross correlation takes into account the physicochemical properties of dinucleotide and integrates global sequence order information into our model. The dimension of MSA feature will increase rapidly with the increase of parameter λ, and its value range from 1 to 10. As shown in Fig. 5, λ = 9 reaches the highest accuracy for identifying promoters and non-promoters, while λ = 7 leads the best accuracy for identifying strong promoters and weak
Discussion
In order to find the sequence differences on biological pattern between promoter and non-promoter as well as those strong promoter and weak promoter, Two Sample Logos [51,65] with independent t-test (p-value < 0.05) is applied to visualize the sequence. It is obvious from Fig. 10(I) that compared with non-promoters along most promoter sequences, each position is enriched with only T and A while depleted with only G and C. For strong promoters and weak promoters, as shown in Fig. 10(II), it can
Conclusion
In this study, a new model named iPro-GAN is established to identify promoters and their strength. Moran-based spatial auto-cross correlation considering physicochemical properties of dinucleotide is used to extract sequence features. Then, a generative adversarial network based on deep convolution is built to predict promoters and their strength. Afterwards, four metrics including accuracy, sensitivity, specificity, Mathew's correlation coefficient and the area under Receiver Operating
Ethical approval
No ethics approval is required.
Declaration of Competing Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (No. 12101480), the Natural Science Basic Research Program of Shaanxi (No. 2021JM-115), and the Fundamental Research Funds for the Central Universities (No. JB210715).
References (65)
Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome
Comput. Chem.
(2001)- et al.
The recognition and prediction of sigma70 promoters in Escherichia coli K-12
J. Theor. Biol.
(2006) - et al.
iPromoter-ET: identifying promoters and their strength by extremely randomized trees-based feature selection
Anal. Biochem.
(2021) - et al.
Prediction of human protein subcellular localization using deep learning
J. Parallel Distr. Com.
(2018) - et al.
Evaluation of deep learning detection and classification towards computer-aided diagnosis of breast lesions in digital X-ray mammograms
Comput. Meth. Prog. Bio.
(2020) - et al.
LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion
Chemometr. Intell. Lab.
(2019) - et al.
Identifying DNase I hypersensitive sites using multi-features fusion and F-score features selection via Chou's 5-steps rule
Biophys. Chem.
(2019) - et al.
A GAN-based image synthesis method for skin lesion classification
Comput. Meth. Prog. Bio.
(2020) - et al.
Mass Image Synthesis in Mammogram with Contextual Information Based on GANs
Comput. Meth. Prog. Bio.
(2021) - et al.
Evaluating the performance of face sketch generation using generative adversarial networks
Pattern Recogn. Lett.
(2019)
iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition
Anal. Biochem.
KD-KLNMF: identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization
Anal. Biochem.
iR5hmcSC: identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning
Comput. Biol. Chem.
pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach
J. Theor. Biol.
DNNAce: prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion
Chemometr. Intell. Lab.
bTSSfinder: a novel tool for the prediction of promoters in Cyanobacteria and Escherichia coli
Bioinformatics
iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC
Bioinformatics
iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition
Genomics
RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond
Nucleic Acids Res
Large-scale structural analysis of the core promoter in mammalian and plant genomes
Nucleic Acids Res
Improving promoter prediction improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences
Bioinformatics
Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition
Mol. BioSyst.
Recognition of prokaryotic promoters based on a novel variable-window Z-curve method
Nucleic Acids Res
DNA duplex stability as discriminative characteristic for Escherichia coli (54)- and (28)- dependent promoter sequences
Biol. J. Int. Assoc. Biol. Stand.
iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition
Nucleic Acids Res
Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-Grams
Front. Bioeng. Biotech.
A Fast Learning Algorithm for Deep Belief Nets
Neural Comput
DG-Font: deformable Generative Networks for Unsupervised Font Generation
On Semantic Similarity in Video Retrieval
A self-adaptive mutation neural architecture search algorithm based on blocks
IEEE Comput. Intell. M.
A Multiobjective evolutionary approach based on graph-in-graph for neural architecture search of convolutional neural networks
Int. J. Neural Syst.
Evolutionary neural architecture search for high-dimensional skip-connection structures on DenseNet style networks
IEEE T. Evolut. Comput.
Cited by (7)
TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters
2023, Briefings in BioinformaticsComputational methods for identifying enhancer-promoter interactions
2023, Quantitative BiologyMLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI
2023, Journal of Chemical Information and Modeling