DAE-TPGM: A deep autoencoder network based on a two-part-gamma model for analyzing single-cell RNA-seq data

https://doi.org/10.1016/j.compbiomed.2022.105578Get rights and content

Highlights

  • The TPGM model captures the bimodal expression pattern and the right-skewed distribution of normalized scRNA-seq data.

  • DAE-TPGM effectively performs imputation and dimensionality reduction of scRNA-seq data simultaneously.

  • DAE-TPGM achieves the improved performance of scRNA-seq data imputation compared with other imputation methods.

Abstract

Single-cell RNA sequencing (scRNA-seq) can reveal differences in genetic material at the single-cell level and is widely used in biomedical studies. However, the minute RNA content within individual cells often results in a high number of dropouts and introduces random noise of scRNA-seq data, concealing the original gene expression pattern. Therefore, data normalization is critical in the analysis pipeline to adjust for unexpected biological and technical effects, leading to a particular bimodal expression pattern exhibited in the semi-continuous normalized data. We further find the positive continuous expression presents a right-skewed distribution, which is still under-explored by mainstream dimensionality reduction and imputation methods. We introduced a deep autoencoder network based on a two-part-gamma model (DAE-TPGM) for joint dimensionality reduction and imputation of scRNA-seq data. DAE-TPGM uses a two-part-gamma model to capture the statistical characteristics of semi-continuous normalized data and adaptively explores the potential relationships between genes for promoting data imputation by deep autoencoder. Just as the classic application scenarios that use an autoencoder in dimensionality reduction, our personalized autoendoer can capture phenotypic information on the peripheral blood mononuclear cells (PBMC) better and clearly infer continuous phenotype information for hematopoiesis in mice. Compared with that of mainstream imputation methods such as MAGIC, SAVER, scImpute and DCA, the new model achieved substantial improvement on the recognition of cellular phenotypes in two real datasets, and the comprehensive analyses on synthetic “ground truth” data demonstrated that our method obtains competitive advantages over other imputation methods in discovering underlying gene expression patterns in time-course data.

Introduction

Cell heterogeneity describes the basic characteristics of organisms [14]. The rapid development of single-cell RNA sequencing (scRNA-seq) in recent years has facilitated research of genetic material at the resolution of a single cell and has increased the throughput of long-span sequencing without increasing its costs [42]. Unlike traditional bulk sequencing techniques commonly used to quantify expression levels from pooled populations of cells [28,35], scRNA-seq provides a powerful tool for identifying different single-cell subpopulations [10,16] and cells differentiation [9]. This technique also facilitates the reconstruction of lineage branching [11] and medical applications in early embryonic tissue development [38,39], organ development [25,31], and cancer research [6,22], etc.

Despite the rapid development of scRNA-seq technologies, one challenge is that each cell contains only approximately 10 pg of total RNA on average, of which only about 0.1 pg is messenger RNA (mRNA). Therefore, after the reverse transcription of mRNA to complementary DNA (cDNA), a powerful amplification step, often by polymerase chain reaction (PCR), is required to generate the necessary concentration of cDNA required for sequencing. Unfortunately, cDNA amplification is never perfectly linear, resulting in an inaccurate representation of all cDNAs present in a cell [26]. This bias increases exponentially with the number of amplification cycles, leading to a series of variations between cells (e.g., variations in the library size), within-cell (e.g., variations in the guanine-cytosine (GC) content), in gene sequence length, and in batch effect [18,21,24]. As a result, data normalization is usually conducted to reduce measuring bias in the preprocessing of scRNA-seq data and to adjust for the unexpected biological and technical effects that can mask the signal of interest [5].

Moreover, the dropout event is another typical problem of scRNA-seq data due to the stochastic nature of gene expression [8], where the genes are observed at moderate to high expression levels in some cells but are not detected in other cells [15]. The limited concentration of mRNA in some cells makes it more likely that a transcript will be “missing” in the RT process, and the gene is consequently not captured during sequencing. The dropout events introduce random noise in the biological signal and conceal the original gene expression pattern, hindering further follow-up research. Several imputation methods have been proposed to counteract this problem and correct for the false missing values in the scRNA-seq data.

At present, most scRNA-seq data imputation methods are mainly based on statistical models. SAVER is a Bayesian framework based on adaptive shrinkage to a multi-gene prediction. It recovers the true expression level of each gene in each cell by removing technical variations while retaining biological variation across cells [13]. DCA learns the underlying true zero-noise data manifold to denoise scRNA-seq data via an autoencoder framework [7]. Considering the zero-inflated single-cell gene expression, the negative likelihood of a zero-inflated negative binomial (ZINB) model is selected as a model loss, and the estimated mean parameters of NB are taken as the imputed gene expression profile. The above two modeling methods are based on count data, while scImpute is based on normalized data. scImpute does not treat all zero counts as missing values, since some of these counts may reflect true biological non-expression [19]. To overcome this limitation, scImpute estimates the dropout probabilities of all genes using a Gamma-Normal mixture model, and it imputes only the dropout genes where their dropout probabilities are greater than or equal to the specified threshold. MAGIC assumes that cell phenotypes can be modeled as a low-dimensional manifold embedded within high-dimensional measurement space [32]. The graph diffusion and its associated diffusion operator are used to simultaneously discover the manifold and impute gene expression based on the manifold structure.

Although these methods have improved the performance of scRNA-seq data analysis by imputation, they still have some limitations. The single-cell gene expression presents a typical bimodal expression pattern, wherein the abundant genes are either highly expressed or unexpressed within individual cells [8,33]. For instance, Finak et al. proposed a two-part generalized regression model to analyze scRNA-seq data that uses a logistic regression model for the gene expression rate and a Gaussian linear model for the positive expression mean [8]. However, imputation approaches for modeling bimodal expression pattern of normalized scRNA-seq data are still under-explored. According to the preliminary analyses on gene expression related to cancer-related fibroblasts (CAF) in breast tumors (see Section 2.1), we find that the positive continuous gene expression is distributed in a right-skewed manner, which is another distribution characteristic of scRNA-seq data.

To fill above gaps, we developed a novel DAE-TPGM model that effectively performs imputation and dimensionality reduction of scRNA-seq data simultaneously based on a two-part-gamma model (TPGM), which can model the bimodal normalized semi-continuous scRNA-seq data. Specifically, the deep autoencoder (DAE) is utilized for inferring the parameters of TPGM model by optimizing the negative log-likelihood of the TPGM. In general, our main contributions include the following three aspects:

  • We introduced a TPGM model for fitting the gene expression of scRNA-seq data that simultaneously captures the rate of expression over the background of various transcripts and right-skewed distribution of positive gene expression.

  • We developed a personalized deep learning model, DAE-TPGM, for a joint learning paradigm of scRNA-seq data imputation and dimensionality reduction that makes use of a deep autoencoder to automatically capture gene-gene dependencies and the high-level complex, non-linear features in the scRNA-seq data. The DAE-TPGM not only accounts for the bimodal expression pattern of normalized data in the imputation of scRNA-seq data, but also adaptively explores the potential relationships between genes to promote data imputation by neural networks.

  • Extensive experiments were performed on real datasets and empirical analyses demonstrated that DAE-TPGM improves various analyses of scRNA-seq data. Specially, compared with mainstream methods of scRNA-seq data imputation, our method enjoyed competitive advantages across many metrics.

In the following sections of the paper, we first analyze the normalized scRNA-seq data from a statistical perspective to lay a foundation for the selection of the appropriate probabilistic model. We next introduce the framework of the DAE-TPGM model and provide details of the techniques used to construct the network and loss function. We then evaluate the proposed model on real datasets from two perspectives of dimensionality reduction and imputation. Finally, we compare the performance of our model with four mainstream imputation methods, MAGIC, SAVER, scImpute, and DCA.

Section snippets

The characteristics of scRNA-seq data

The scRNA-seq data contain excess zero counts, which differs from the traditional bulk transcriptome data. This phenomenon was confirmed by the real dataset from Bartoschek et al. [2]. This study identified heterogeneity of cancer-related fibroblasts (CAF) in breast cancer cells via unsupervised analysis of a single-cell transcriptome. The spatial and functional differences of breast cancer-related fibroblasts were studied using 768 fibroblasts cells isolated from two 14-week-old mouse breast

Recovery of time-course patterns using a synthetic “ground truth” dataset

We firstly compared our method with other imputation methods on the recovery of time-course patterns by conducting experimental analysis on a semi-real single-cell dataset. Due to the dropout phenomenon in scRNA-seq data, the truth of the gene expression profile is usually unknown. To overcome this problem, we adopted the synthesis of a semi-real data method, which is like that used by MAGIC.

The validation dataset was based on bulk transcriptomic data measured using microarrays from 206

Discussion

In this paper, we have presented an unsupervised learning approach for dimensionality reduction and imputation of scRNA-seq data, but it has some limitations and needs to be further improved. As an indispensable part of optimizing deep learning model, hyperparameters such as learning rate, the number of hidden units and layers, mini-batch size and regularization coefficient, directly affect the degree of model optimization and the performance of the model. It is well known that the selection of

Conclusions

The dropout event caused by minute RNA content in single cell is one of the limitations of analyzing scRNA-seq data, which eventually hinders the discovery of important biological findings. In this study, we combined the advantages of DAE and TPGM to create a new model to analyze normalized scRNA-seq data. The advantage of this method is that it utilizes a two-part model to capture bimodal expression patterns of the normalized gene expression. The gamma distribution could effectively fit the

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (GrantNos. 62136004, 61802193), the National Key R&D Program of China (GrantNos. 2018YFC2001600, 2018YFC2001602), the Natural Science Foundation of Jiangsu Province (BK20170934), and the Fundamental Research Funds for the Central Universities (NJ2020023).

Shuchang Zhao received the BSc degree from the Suzhou University, in 2013, and MSc degree from the Anhui University of Science and Technology, in 2016, respectively. Currently, he is working toward the PhD degree in the PARNEC group of the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China. His research interests include machine learning and bioinformatics.

References (42)

  • P. Dalerba et al.

    Single-cell dissection of transcriptional heterogeneity in human colon tumors

    Nat. Biotechnol.

    (2011)
  • G. Eraslan et al.

    Single-cell rna-seq denoising using a deep count autoencoder

    Nat. Commun.

    (2019)
  • G. Finak et al.

    Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data

    Genome Biol.

    (2015)
  • M. Francesconi et al.

    The effects of genetic variation on gene expression dynamics during development

    Nature

    (2014)
  • M. Guo et al.

    Sincera: a pipeline for single-cell rna-seq profiling analysis

    PLoS Comput. Biol.

    (2015)
  • L. Haghverdi et al.

    Diffusion pseudotime robustly reconstructs lineage branching

    Nat. Methods

    (2016)
  • K. He et al.

    Deep residual learning for image recognition

  • M. Huang et al.

    Gene expression recovery for single cell rna sequencing

    bioRxiv

    (2017)
  • T. Kalisky et al.

    Genomic analysis at the single-cell level

    Annu. Rev. Genet.

    (2011)
  • P.V. Kharchenko et al.

    Bayesian approach to single-cell differential expression analysis

    Nat. Methods

    (2014)
  • V.Y. Kiselev et al.

    Sc3: consensus clustering of single-cell rna-seq data

    Nat. Methods

    (2017)
  • Shuchang Zhao received the BSc degree from the Suzhou University, in 2013, and MSc degree from the Anhui University of Science and Technology, in 2016, respectively. Currently, he is working toward the PhD degree in the PARNEC group of the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China. His research interests include machine learning and bioinformatics.

    Li Zhang received the BSc degree from the Changsha University of Science and Technology, in 2007, and the MSc and PhD degrees from the Nanjing University of Aeronautics and Astronautics (NUAA), in 2010 and 2015, respectively. He joined the College of Computer Science and Technology, Nanjing Forestry University, as a Lecturer, in 2016. His current research interests include machine learning and bioinformatics.

    Xuejun Liu received the BSc and MSc degrees in computer science from Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China, in 1999 and 2002, respectively, and the PhD degree in computer science from the University of Manchester, Manchester, UK, in 2006. Currently, she is a professor in the PARNEC group of the College of Computer Science and Technology at NUAA, Nanjing, China. Her research interests include machine learning and its practical applications, including bioinformatics.

    View full text