Elsevier

Biosystems

Volume 150, December 2016, Pages 92-98
Biosystems

Predicting gene expression level by the transcription factor binding signals in human embryonic stem cells

https://doi.org/10.1016/j.biosystems.2016.08.011Get rights and content

Highlights

  • The distributions of 57 kinds of transcription factors binding signals in the genome are computed.

  • Transcription factors synthetic indexes (TFSIs) are defined by their association strength.

  • A statistics model for predicting gene expression level is established by 57 TFSIs.

  • The Up-regulated and Down-regulated genes of 57 kinds of transcription factors are predicted.

  • 8 TFSIs which are vital for predicting gene expression are selected out.

Abstract

The transcription factor (TF) binding signals play important role in the control of gene expression. In this study, to elucidate the relationship between the transcription factor binding signals and gene expression, we firstly analyze the distributions of 57 kinds of TFs’ binding signals in human H1 embryonic stem cells. Their distributions in highly and lowly expressed genes are further compared. On this basis, a statistic model of predicting gene expression level is constructed by using 57 kinds of transcription factor synthetic indexes (TFSIs). Then, the TF’s Down-regulated and Up-regulated genes are predicted and the statistics significance is estimated by one-sided Kolmogorov-Smirnov test. Based on the stepwise regression analysis, the “optimal” TFSIs are selected out, and the better results for predicting the expression level of genes with high CpG content promoters (HCPs) and low CpG content promoters (LCPs) are obtained.

Introduction

Embryonic stem cells (ESCs) derived from blastocysts are self-renewal and pluripotent (Evans and Kaufman, 1981, Hwang et al., 2004). Hence, understanding the gene regulatory system in ESCs is an important process for promoting regenerative medicine. Earlier study showed that gene expression of eukaryotic organism is controlled by multi factors (Berger, 2007, Farnham, 2009). Among the factors, transcription factors (TFs) play a crucial role, TFs can activate or suppress the initiation of gene transcription by binding to specific DNA sequences in promoters or enhancers (Budden et al., 2014, Shlyueva et al., 2014). They can also regulate gene expression by recruiting chromatin-modifying enzymes to induce the changes of chromatin structure (Berger, 2007, Zhang et al., 2014). In higher eukaryotes organisms, several TFs could act cooperatively to form a complex regulatory pattern to precisely regulate gene expression levels (Pougach et al., 2014, Zhang et al., 2014).

In earlier studies, to better understand the relationships between transcription factors and gene expression, a predictive model could be constructed in which gene expression levels were regarded as response variables and various features related to TFs such as the motifs recognized by the TFs (Bussemaker et al., 2001), transcription factor binding sites (Yuan et al., 2007) and the motif scores based on position-specific weight matrices (Conlon et al., 2003) were selected as information parameters. Among the large number of statistical models, many models were based on nonlinear regression, for example, Sun et al. (2006) proposed Bayesian error analysis model by integrating protein-DNA binding data and gene expression data in which measurement errors were explicitly considered; Boulesteix and Strimmer (2005) developed a statistical approach based on partial least squares regression to infer the transcription factor activities from a combination of mRNA expression and DNA-protein binding measurements; and Das et al. (2006) presented multivariate adaptive regression splines which uncovered several human transcriptional subnetworks. But there were also extensions, for instance, Ouyang et al. (2009), Cheng and Gerstein (2012) and Park and Nakai (2011) investigated the relation between transcription factors and gene expression level by combining the linear regression. However, despite extensive efforts had been made, many approaches were developed for mouse or their work only focused on several TFs.

In this paper, we investigate the relationship between 57 kinds of TFs and gene expression level in human embryonic stem cells and construct a model for predicting gene expressive level. Based on the model, we predict the TFs’ Up-regulated and Down-regulated genes by using the transcription factor association strength (TFAS) and use one-sided Kolmogorov-Smirnov test to verify the statistical significance. The results show that many targets of TF confirmed by experiments are consist with our conclusions and P < 2.2 × 10−16 for Up-regulated and Down-regulated genes. In addition, we further reduce a “optimal” model by using stepwise regression analysis and apply the “optimal” model to predict the expressive level of genes with high CpG content promoters (HCPs) and low CpG content promoters (LCPs).The results show that our approach not only achieves a better predictive effect and indict that the expression of the HCP genes and LCP genes may be regulated by different mechanism.

Section snippets

Chip-seq and gene expression data

The RefSeq genes of human genome (hg19, February 2009) are downloaded from the database of UCSC (http://genome.ucsc.edu/), which contain the gene names, names, chromosomes, strands, transcription starts, transcription ends, coding region starts, coding region ends, exon counts, exon starts, and exon ends. The genes which begin with NM (the mature messenger RNA) are chosen out. In order to avoid the possibility that some of the RefSeq genes are actually the alternative transcripts of the same

TF binding signal and gene expression

For each TF, we calculate its binding signal in each of the 100 bins for all Refseq genes by Eq. (1), which is then averaged across all genes to obtain the signal profile of TF in the 100 bins. In order to further investigate the relationship between TF binding position and gene expression level, the correlations between TF binding signal in each of the bins and gene expression level are estimated by using spearman correlation coefficient.

The signal distribution (black) and correlation (red) are

Conclusions

In this study, by using 57 kinds of transcription factors ChIP-seq data and mRNA-seq data of the human H1 cell line, we systematically research the relations between transcription factors binding and gene expression level in the reference genome and compare the signal distributions difference between the highly expressed genes and lowly expressed genes. Our results give the quantified relations between transcription factor and gene expression level and display that the DNA regions with stronger

Acknowledgment

This work is supported by a grant from the National Natural Science Foundation of China (No. 31460234).

References (41)

  • D.M. Budden et al.

    Predicting expression: the complementary power of histone modification and transcription factor binding data

    Epigenetics. Chromatin.

    (2014)
  • H.J. Bussemaker et al.

    Regulatory element detection using correlation with expression

    Nat. Genet.

    (2001)
  • R. Cao et al.

    Role of histone H3 lysine 27 methylation in polycomb-group silencing

    Science

    (2002)
  • C. Cheng et al.

    Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells

    Nucleic Acids Res.

    (2012)
  • E.M. Conlon et al.

    Integrating regulatory motif discovery and genome-wide expression analysis

    Proc. Natl. Acad. Sci. U. S. A.

    (2003)
  • J.M. Coulson et al.

    Tumour-specific arginine vasopressin promoter activation in small-cell lung cancer

    Br. J. Cancer

    (1999)
  • D. Das et al.

    Adaptively inferring human transcriptional subnetworks

    Mol. Syst. Biol.

    (2006)
  • M.J. Evans et al.

    Establishment in culture of pluripotential cells from mouse embryos

    Nature (London)

    (1981)
  • P.J. Farnham

    Insights from genomic profiling of transcription factors

    Nat. Rev. Genet.

    (2009)
  • D.E. Farrar et al.

    Multicollinearity in regression analysis: the problem revisited

    Rev. Econ. Stat.

    (1967)
  • Cited by (11)

    • Pan-cancer identification of the relationship of metabolism-related differentially expressed transcription regulation with non-differentially expressed target genes via a gated recurrent unit network

      2022, Computers in Biology and Medicine
      Citation Excerpt :

      Beer et al. predicted gene expression levels using the Bayesian probabilistic framework to extract the sequence features for expression patterns, and demonstrated satisfactory prediction performance of the gene expression levels of Caenorhabditis elegans [17]. Multiple regression algorithms including linear regression, multivariate linear regression and support vector regression (SVR) have been used to predict gene expression levels based on TF-binding signals from ChIP-seq and chromatin data [16,18–20]. The data of other dynamic patterns, such as TF-binding alternations [21,22] and TF-binding scores, have also been successfully used to predict gene expression [23].

    • The impact of gene-body H3K36me3 patterns on gene expression level changes in chronic myelogenous leukemia

      2021, Gene
      Citation Excerpt :

      For each TF, we capture its motif sites by applying the “Homer” software with the following command: scanMotifGenomeWide.pl <motif file> <genome> [options] (Heinz et al., 2010). Next, we calculate the level of each TF in the body region of each gene using Eq. (1) in the supplemental files (Zhang et al., 2016, 2018; Zhang and Li, 2017). Then, 17 vital TFs for predicting gene expression level changes are identified by performing stepwise regression analysis (see Supplementary information).

    • MACMIC Reveals A Dual Role of CTCF in Epigenetic Regulation of Cell Identity Genes

      2021, Genomics, Proteomics and Bioinformatics
      Citation Excerpt :

      In other words, highly correlated features often contain redundant information [8]. For example, whereas the dozens of pluripotent factors such as Oct4, Sox2, Klf4, and c-Myc, are all useful to predict genes expressed in stem cells [9–11], combining some pluripotent factors with endothelial lineage factors such as Lmo2 and Erg would add power to also predict genes expressed in endothelial cells; therefore, it can be more powerful using combined information from transcription factors with distinct functions, as opposed to an analysis using the transcription factors with similar effects on a shared set of target genes. More importantly, colocalization of low-correlation chromatin features may still happen in a biologically meaningful manner to implement important functions.

    • Genome-wide analysis of H3K36me3 and its regulations to cancer-related genes expression in human cell lines

      2018, BioSystems
      Citation Excerpt :

      Epigenetics has become a hot topic in recent years (Brusslan et al., 2015; Gyorffy et al., 2016; Lawrence et al., 2016; Radaiglesias, 2018; Zhang et al., 2016).

    View all citing articles on Scopus
    View full text