iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength

https://doi.org/10.1016/j.cmpb.2022.106625Get rights and content

Highlights

  • A novel predictor called iPro-GAN to identify promoters and their strength.

  • Moran-based spatial auto-cross correlation considering physicochemical properties of dinucleotide is used to extract sequence features.

  • A generative adversarial network based on deep convolution is established to solve the binary classification problem in bioinformatics for the first time.

Abstract

Background and Objective

Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it.

Methods

In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type.

Results

On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters’ strength.

Conclusions

These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.

Introduction

Promoter is a DNA sequence recognized, combined and transcribed by RNA polymerase, like a “switch” and determines gene activity [1]. However, the promoter itself does not control gene activity but controls gene activity by binding to a protein called transcription factor. Transcription factor acts as a “flag” to direct the activities of enzymes (RNA polymerase) [2]. This enzyme produces RNA copies of genes, which are generally divided into broad-spectrum expression promoter, tissue-specific promoter, tumor-specific promoter and other forms. Mutations in the promoter of a gene result in the regulation of gene expression, which is common in malignant tumors. In bacteria, σ factors can lead RNA holoenzyme to decide transcription initiation site, which is the combination of RNA polymerase core enzyme and regulatory protein. In E. coli, there are several σ factors, such as σ2428323854 and σ70, each of which has a specific function [2], [3], [4]. Consequently, identifying promoters and their strength is of great significance.

Traditional methods such as CHIP-seq and RNA-seq are difficult, time-consuming and expensive to identify promoter. Moreover, the explosive growth of biological sequence makes it necessary to develop some computational models. At present, there are some researches on the statistical model for promoter. For example, Reese [5] used time-delay neural network for promoter prediction (NNPP), and the model gave a recognition rate of 75%. In 2005, Florquin et al. [6] used the Large-scale structural characteristics (LSSC) of nucleotides to analysis core promoters, and the overall classification power was close to 80%. In the same year, Burden et al. [7] created a new technology called TSL-NNPP by combining the consistent empirical probability distribution information of E. coli promoters with the results from NNPP2.2, and the probability of misprediction decreased to 52%. In 2006, Li et al. [8] proposed a position-correlation scoring matrix (PCSM) algorithm to identify σ70 promoters by sequence conservative analysis, and the overall prediction sensitivity and specificity were 91% and 81% respectively. In 2009, using the relative stability of DNA as a general criterion for promoters’ prediction was proposed by Rangannan et al., and the model was named PromPredict tool [9]. The results showed that precision values of 58% in E. coli and 60% in B. subtilis. In 2012, a model based on a new variable-window Z-curve and partial least squares technique (VWZ-PLST) was developed by Song [10] to identify core promoters. The accuracy was more than 90% on each sub dataset, which implied that the model VWZ-PLST exceeded all previous methods. In 2014, S. Silva et al. [11] applied DNA double strand stability (DBSS) as a distinguishing feature to identify σ54 and σ28 promoters, and obtained accuracies of 78.8% and 80%, respectively. In the same year, Lin et al. [12] developed a predictor named iPro54-PseKNC, and the accuracy of 93.79% was achieved for identifying σ54 promoters. In 2018, Liu et al. [2] set up a two-layer predictor, which is the first model called iPromoter-2 L for identifying promoters and their strength. The accuracy of the first layer was 81.68%, and the accuracy of the second layer was between 80% and 95%. In 2019, a two-layer model iPSW(2 L)-PseKNC was built up by Xiao et al. [3] to identify promoters and their strength. The accuracy of Layer-1 was 83.13% and the accuracy of Layer-2 was 71.20% on the benchmark dataset. In the same year, Le et al. [13] classified promoters by deep learning and combination of continuous FastText N-Grams (DL-CFTNG), and this method got accuracy of 85.41% and 73.1% in the two layers. Recently, Liang et al. developed a new model called iPromoter-ET [14] to study promoters by multiple features extraction methods and extremely randomized trees for selecting optimal features. The accuracy of the 1st-layer of iPromoter-ET reached 85.14%, and the accuracy of 2nd layer was 72.59%. In addition, Table 1 shows the data information used in the above model, and also attaches the data source by inserting a hyperlink.

Researchers have made a lot of contributions in identifying promoters, and the prediction accuracy has been improved, but there are still some problems. There are few models that can identify promoters and their strength, most of which can only identify promoters but not their types. Besides, most models lack independent data sets to test models except [3]. Therefore, it is of great significance to develop a convincing model that can more accurately predict promoters and their strength.

The deep learning proposed by Hinton et al. [15] is a new field of machine learning research, which is inspired by artificial neural network. Deep learning is a neural network that combines low-level features to form more abstract high-level features to discover the distributed feature representation of data, and simulates the human brain for analysis and learning. Deep learning algorithms can be divided into three categories: convolutional neural network, recursive neural network and generative countermeasure network. They have been successfully applied in computer vision, speech recognition, natural language processing and other fields [16], [17], [18], [19], [20], [21]. In recent years, the use of deep learning for biological sequence recognition has achieved satisfactory results. [22], [23], [24], [25], [26].

Based on previous studies about promoter and generative adversarial networks, a novel model containing two layers, named iPro-GAN, is established to identify promoters and their strength by Moran-based spatial auto-cross correlation for feature extracting and generative adversarial networks based on deep convolution for classifying. Most importantly, the model has achieved the best results so far. Fig. 1 gives the process of how iPro-GAN works.

Section snippets

Dataset

The first step in developing a reliable prediction model is to construct a high-quality benchmark dataset. All the core promoter sequences are obtained from Xiao et al.’s study [3], which can be formulated asS=S+SS+=Sstrong+Sweak+where S+ contains 3382 promoters, S contains 3382 non-promoters, Sstrong+ contains 1591 strong promoters, Sweak+ contains 1792 weak promoters, and ∪ means union which is a mathematical operator. Those sample sequences with high similarity have been removed by

Optimal features

In feature extraction, Moran based spatial auto-cross correlation takes into account the physicochemical properties of dinucleotide and integrates global sequence order information into our model. The dimension of MSA feature will increase rapidly with the increase of parameter λ, and its value range from 1 to 10. As shown in Fig. 5, λ = 9 reaches the highest accuracy for identifying promoters and non-promoters, while λ = 7 leads the best accuracy for identifying strong promoters and weak

Discussion

In order to find the sequence differences on biological pattern between promoter and non-promoter as well as those strong promoter and weak promoter, Two Sample Logos [51,65] with independent t-test (p-value < 0.05) is applied to visualize the sequence. It is obvious from Fig. 10(I) that compared with non-promoters along most promoter sequences, each position is enriched with only T and A while depleted with only G and C. For strong promoters and weak promoters, as shown in Fig. 10(II), it can

Conclusion

In this study, a new model named iPro-GAN is established to identify promoters and their strength. Moran-based spatial auto-cross correlation considering physicochemical properties of dinucleotide is used to extract sequence features. Then, a generative adversarial network based on deep convolution is built to predict promoters and their strength. Afterwards, four metrics including accuracy, sensitivity, specificity, Mathew's correlation coefficient and the area under Receiver Operating

Ethical approval

No ethics approval is required.

Declaration of Competing Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 12101480), the Natural Science Basic Research Program of Shaanxi (No. 2021JM-115), and the Fundamental Research Funds for the Central Universities (No. JB210715).

References (65)

  • Z. Liu et al.

    iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition

    Anal. Biochem.

    (2015)
  • S. Zhang et al.

    KD-KLNMF: identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization

    Anal. Biochem.

    (2020)
  • S. Zhang et al.

    iR5hmcSC: identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning

    Comput. Biol. Chem.

    (2021)
  • J. Jia et al.

    pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach

    J. Theor. Biol.

    (2016)
  • B. Yu et al.

    DNNAce: prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion

    Chemometr. Intell. Lab.

    (2020)
  • I.A. Shahmuradov et al.

    bTSSfinder: a novel tool for the prediction of promoters in Cyanobacteria and Escherichia coli

    Bioinformatics

    (2016)
  • B. Liu et al.

    iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC

    Bioinformatics

    (2018)
  • X. Xiao et al.

    iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition

    Genomics

    (2019)
  • G.C. Socorro et al.

    RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond

    Nucleic Acids Res

    (2016)
  • K. Florquin et al.

    Large-scale structural analysis of the core promoter in mammalian and plant genomes

    Nucleic Acids Res

    (2005)
  • S. Burden et al.

    Improving promoter prediction improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences

    Bioinformatics

    (2005)
  • V. Rangannan et al.

    Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition

    Mol. BioSyst.

    (2009)
  • K. Song

    Recognition of prokaryotic promoters based on a novel variable-window Z-curve method

    Nucleic Acids Res

    (2012)
  • S. Silva et al.

    DNA duplex stability as discriminative characteristic for Escherichia coli (54)- and (28)- dependent promoter sequences

    Biol. J. Int. Assoc. Biol. Stand.

    (2014)
  • H. Lin et al.

    iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition

    Nucleic Acids Res

    (2014)
  • N.Q.K. Le et al.

    Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous FastText N-Grams

    Front. Bioeng. Biotech.

    (2019)
  • G. Hinton et al.

    A Fast Learning Algorithm for Deep Belief Nets

    Neural Comput

    (2006)
  • Y. Xie et al.

    DG-Font: deformable Generative Networks for Unsupervised Font Generation

  • M. Wray et al.

    On Semantic Similarity in Video Retrieval

  • Y. Xue et al.

    A self-adaptive mutation neural architecture search algorithm based on blocks

    IEEE Comput. Intell. M.

    (2021)
  • Y. Xue et al.

    A Multiobjective evolutionary approach based on graph-in-graph for neural architecture search of convolutional neural networks

    Int. J. Neural Syst.

    (2021)
  • D. O'Neill et al.

    Evolutionary neural architecture search for high-dimensional skip-connection structures on DenseNet style networks

    IEEE T. Evolut. Comput.

    (2021)
  • View full text