Predicting gene expression level by the transcription factor binding signals in human embryonic stem cells
Introduction
Embryonic stem cells (ESCs) derived from blastocysts are self-renewal and pluripotent (Evans and Kaufman, 1981, Hwang et al., 2004). Hence, understanding the gene regulatory system in ESCs is an important process for promoting regenerative medicine. Earlier study showed that gene expression of eukaryotic organism is controlled by multi factors (Berger, 2007, Farnham, 2009). Among the factors, transcription factors (TFs) play a crucial role, TFs can activate or suppress the initiation of gene transcription by binding to specific DNA sequences in promoters or enhancers (Budden et al., 2014, Shlyueva et al., 2014). They can also regulate gene expression by recruiting chromatin-modifying enzymes to induce the changes of chromatin structure (Berger, 2007, Zhang et al., 2014). In higher eukaryotes organisms, several TFs could act cooperatively to form a complex regulatory pattern to precisely regulate gene expression levels (Pougach et al., 2014, Zhang et al., 2014).
In earlier studies, to better understand the relationships between transcription factors and gene expression, a predictive model could be constructed in which gene expression levels were regarded as response variables and various features related to TFs such as the motifs recognized by the TFs (Bussemaker et al., 2001), transcription factor binding sites (Yuan et al., 2007) and the motif scores based on position-specific weight matrices (Conlon et al., 2003) were selected as information parameters. Among the large number of statistical models, many models were based on nonlinear regression, for example, Sun et al. (2006) proposed Bayesian error analysis model by integrating protein-DNA binding data and gene expression data in which measurement errors were explicitly considered; Boulesteix and Strimmer (2005) developed a statistical approach based on partial least squares regression to infer the transcription factor activities from a combination of mRNA expression and DNA-protein binding measurements; and Das et al. (2006) presented multivariate adaptive regression splines which uncovered several human transcriptional subnetworks. But there were also extensions, for instance, Ouyang et al. (2009), Cheng and Gerstein (2012) and Park and Nakai (2011) investigated the relation between transcription factors and gene expression level by combining the linear regression. However, despite extensive efforts had been made, many approaches were developed for mouse or their work only focused on several TFs.
In this paper, we investigate the relationship between 57 kinds of TFs and gene expression level in human embryonic stem cells and construct a model for predicting gene expressive level. Based on the model, we predict the TFs’ Up-regulated and Down-regulated genes by using the transcription factor association strength (TFAS) and use one-sided Kolmogorov-Smirnov test to verify the statistical significance. The results show that many targets of TF confirmed by experiments are consist with our conclusions and P < 2.2 × 10−16 for Up-regulated and Down-regulated genes. In addition, we further reduce a “optimal” model by using stepwise regression analysis and apply the “optimal” model to predict the expressive level of genes with high CpG content promoters (HCPs) and low CpG content promoters (LCPs).The results show that our approach not only achieves a better predictive effect and indict that the expression of the HCP genes and LCP genes may be regulated by different mechanism.
Section snippets
Chip-seq and gene expression data
The RefSeq genes of human genome (hg19, February 2009) are downloaded from the database of UCSC (http://genome.ucsc.edu/), which contain the gene names, names, chromosomes, strands, transcription starts, transcription ends, coding region starts, coding region ends, exon counts, exon starts, and exon ends. The genes which begin with NM (the mature messenger RNA) are chosen out. In order to avoid the possibility that some of the RefSeq genes are actually the alternative transcripts of the same
TF binding signal and gene expression
For each TF, we calculate its binding signal in each of the 100 bins for all Refseq genes by Eq. (1), which is then averaged across all genes to obtain the signal profile of TF in the 100 bins. In order to further investigate the relationship between TF binding position and gene expression level, the correlations between TF binding signal in each of the bins and gene expression level are estimated by using spearman correlation coefficient.
The signal distribution (black) and correlation (red) are
Conclusions
In this study, by using 57 kinds of transcription factors ChIP-seq data and mRNA-seq data of the human H1 cell line, we systematically research the relations between transcription factors binding and gene expression level in the reference genome and compare the signal distributions difference between the highly expressed genes and lowly expressed genes. Our results give the quantified relations between transcription factor and gene expression level and display that the DNA regions with stronger
Acknowledgment
This work is supported by a grant from the National Natural Science Foundation of China (No. 31460234).
References (41)
- et al.
Mixtures of robust probabilistic principal component analyzers
Neurocomputing
(2008) - et al.
Core transcriptional regulatory circuitry in human embryonic stem cells
Cell
(2005) - et al.
REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons
Cell
(1995) - et al.
JAZF1/JJAZ1 gene fusion in endometrial stromal sarcomas—molecular analysis by reverse transcriptase-polymerase chain reaction optimized for paraffin-embedded tissue
J. Mol. Cell. Cardiol.
(2005) - et al.
Transcriptional regulation of nanog by OCT4 and SOX2
J. Biol. Chem.
(2005) Introduction to principal components analysis
PM&R
(2014)- et al.
Amplification of the ch19p13: 2 NACC1 locus in ovarian high-grade serous carcinoma
Modern. Pathol.
(2011) - et al.
The transcription factor Myc controls metabolic reprogramming upon T lymphocyte activation
Immunity
(2011) The complex language of chromatin regulation during transcription
Nature
(2007)- et al.
Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach
Theor. Biol. Med. Model.
(2005)
Predicting expression: the complementary power of histone modification and transcription factor binding data
Epigenetics. Chromatin.
Regulatory element detection using correlation with expression
Nat. Genet.
Role of histone H3 lysine 27 methylation in polycomb-group silencing
Science
Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells
Nucleic Acids Res.
Integrating regulatory motif discovery and genome-wide expression analysis
Proc. Natl. Acad. Sci. U. S. A.
Tumour-specific arginine vasopressin promoter activation in small-cell lung cancer
Br. J. Cancer
Adaptively inferring human transcriptional subnetworks
Mol. Syst. Biol.
Establishment in culture of pluripotential cells from mouse embryos
Nature (London)
Insights from genomic profiling of transcription factors
Nat. Rev. Genet.
Multicollinearity in regression analysis: the problem revisited
Rev. Econ. Stat.
Cited by (11)
Pan-cancer identification of the relationship of metabolism-related differentially expressed transcription regulation with non-differentially expressed target genes via a gated recurrent unit network
2022, Computers in Biology and MedicineCitation Excerpt :Beer et al. predicted gene expression levels using the Bayesian probabilistic framework to extract the sequence features for expression patterns, and demonstrated satisfactory prediction performance of the gene expression levels of Caenorhabditis elegans [17]. Multiple regression algorithms including linear regression, multivariate linear regression and support vector regression (SVR) have been used to predict gene expression levels based on TF-binding signals from ChIP-seq and chromatin data [16,18–20]. The data of other dynamic patterns, such as TF-binding alternations [21,22] and TF-binding scores, have also been successfully used to predict gene expression [23].
The impact of gene-body H3K36me3 patterns on gene expression level changes in chronic myelogenous leukemia
2021, GeneCitation Excerpt :For each TF, we capture its motif sites by applying the “Homer” software with the following command: scanMotifGenomeWide.pl <motif file> <genome> [options] (Heinz et al., 2010). Next, we calculate the level of each TF in the body region of each gene using Eq. (1) in the supplemental files (Zhang et al., 2016, 2018; Zhang and Li, 2017). Then, 17 vital TFs for predicting gene expression level changes are identified by performing stepwise regression analysis (see Supplementary information).
MACMIC Reveals A Dual Role of CTCF in Epigenetic Regulation of Cell Identity Genes
2021, Genomics, Proteomics and BioinformaticsCitation Excerpt :In other words, highly correlated features often contain redundant information [8]. For example, whereas the dozens of pluripotent factors such as Oct4, Sox2, Klf4, and c-Myc, are all useful to predict genes expressed in stem cells [9–11], combining some pluripotent factors with endothelial lineage factors such as Lmo2 and Erg would add power to also predict genes expressed in endothelial cells; therefore, it can be more powerful using combined information from transcription factors with distinct functions, as opposed to an analysis using the transcription factors with similar effects on a shared set of target genes. More importantly, colocalization of low-correlation chromatin features may still happen in a biologically meaningful manner to implement important functions.
Genome-wide analysis of H3K36me3 and its regulations to cancer-related genes expression in human cell lines
2018, BioSystemsCitation Excerpt :Epigenetics has become a hot topic in recent years (Brusslan et al., 2015; Gyorffy et al., 2016; Lawrence et al., 2016; Radaiglesias, 2018; Zhang et al., 2016).
A deep learning model to identify gene expression level using cobinding transcription factor signals
2022, Briefings in Bioinformatics