Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation

Demetçi, Pınar; Santorella, Rebecca; Sandstede, Björn; Singh, Ritambhara

doi:10.1007/978-3-031-04749-7_1

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2316 Accesses
1 Citations

Abstract

Integrated analysis of multi-omics data allows the study of how different molecular views in the genome interact to regulate cellular processes; however, with a few exceptions, applying multiple sequencing assays on the same single cell is not possible. While recent unsupervised algorithms align single-cell multi-omic datasets, these methods have been primarily benchmarked on co-assay experiments rather than the more common single-cell experiments taken from separately sampled cell populations. Therefore, most existing methods perform subpar alignments on such datasets. Here, we improve our previous work Single Cell alignment using Optimal Transport (SCOT) by using unbalanced optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. We show that our proposed method, SCOTv2, consistently yields quality alignments on five real-world single-cell datasets with varying cell-type proportions and is computationally tractable. Additionally, we extend SCOTv2 to integrate multiple ($M\ge 2$) single-cell measurements and present a self-tuning heuristic process to select hyperparameters in the absence of any orthogonal correspondence information.

Available at: http://rsinghlab.github.io/SCOT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Preprocessed data for the scGEM dataset accessed here: https://github.com/jw156605/MATCHER.
2.
Dimensionality reduced data, used by Pamona and us, here: https://github.com/caokai1073/Pamona/tree/master/scNMT. Preprocessing scripts for the raw data provided by the authors here: https://github.com/PMBio/scNMT-seq/.

References

Bonora, G., et al.: Single-cell landscape of nuclear configuration and gene expression during stem cell differentiation and x inactivation. Genome Biol. 22(1), 279 (2021). https://doi.org/10.1186/s13059-021-02432-w
Article Google Scholar
Alvarez-Melis, D., Jaakkola, T.S.: Gromov-wasserstein alignment of word embedding spaces. arXiv preprint arXiv:1809.00013 (2018)
Argelaguet, R., Clark, S.J., Mohammed, H., Stapel, L.C., Krueger, C., Kapourani, C.A., et al.: Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576(7787), 487–491 (2019). https://doi.org/10.1038/s41586-019-1825-8
Article Google Scholar
Cao, K., Bai, X., Hong, Y., Wan, L.: Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 36(Suppl._1), i48–i56 (2020)
Google Scholar
Cao, K., Hong, Y., Wan, L.: Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38(1), 211–219 (2021). https://doi.org/10.1093/bioinformatics/btab594
Article Google Scholar
Chen, S., Lake, B.B., Zhang, K.: High-throughput sequencing of transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37(12), 1452–1457 (2019)
Article Google Scholar
Cheow, L.F., Courtois, E.T., Tan, Y., Viswanathan, R., Xing, Q., Tan, R.Z., et al.: Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13(10), 833–836 (2016)
Article Google Scholar
Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., et al.: scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9(1), 1–9 (2018)
Article Google Scholar
Demetci, P., Santorella, R., Sandstede, B., Noble, W.S., Singh, R.: Gromov-wasserstein optimal transport to align single-cell multi-omics data. BioRxiv (2020)
Google Scholar
Dou, J., Liang, S., Mohanty, V., Cheng, X., Kim, S., Choi, J., et al.: Unbiased integration of single cell multi-omics data. bioRxiv (2020). https://doi.org/10.1101/2020.12.11.422014. https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422014
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)
Article MathSciNet Google Scholar
Liu, J., Huang, Y., Singh, R., Vert, J.P., Noble, W.S.: Jointly embedding multiple single-cell omics measurements. BioRxiv, p. 644310 (2019)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Singh, R., Demetci, P., Bonora, G., Ramani, V., Lee, C., Fang, H., et al.: Unsupervised manifold alignment for single-cell multi-omics data. In: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–10 (2020)
Google Scholar
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W.M.M., et al.: Comprehensive integration of single-cell data. Cell 77(7), 1888–1902 (2019)
Google Scholar
Séjourné, T., Vialard, F.X., Peyré, G.: The unbalanced gromov wasserstein distance: Conic formulation and relaxation. arXiv (2021)
Google Scholar
Welch, J.D., Hartemink, A.J., Prins, J.F.: Matcher: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol. 18(1), 138 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA
Pınar Demetçi & Ritambhara Singh
Department of Computer Science, Brown University, Providence, RI, 02912, USA
Pınar Demetçi & Ritambhara Singh
Division of Applied Mathematics, Brown University, Providence, RI, 02912, USA
Rebecca Santorella & Björn Sandstede

Authors

Pınar Demetçi
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Santorella
View author publications
You can also search for this author in PubMed Google Scholar
Björn Sandstede
View author publications
You can also search for this author in PubMed Google Scholar
Ritambhara Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ritambhara Singh .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Itsik Pe'er

Appendix

1.1 Embedding Method Details

The full details of t-SNE can be found in [13]. For each domain m, we compute $P^{m}$, an $n_m \times n_m$ cell-to-cell transition matrix; each entry $P^{m}_{j|i}$ is the conditional probability that a data point $x_i^m$ would pick $x_j^m$ as its neighbor when chosen according a Gaussian distribution centered at $x_i^m$:

$$\begin{aligned} P_{j|i}^m =\frac{\exp (- ||x_i^m - x_j^m||^2 / 2\sigma _i^2)}{\sum _{k \ne i} \exp (-||x_i^m - x_k^m||^2/2\sigma _i^2)}. \end{aligned}$$

(9)

The bandwidth $\sigma _i$ is chosen according to the density of the data points through a binary search for the value of $\sigma _i$ that achieves the user-supplied perplexity value. $P^m$ is computed by averaging $P^m_{i|j}$ and $P^m_{j|i}$ to give more weight to outlier points:

$$\begin{aligned} P^m_{ij} = \frac{P_{i|j}^m + P_{j|i}^m}{2 n_m} \end{aligned}$$

(10)

Then, to jointly embed all domains through the anchor domain $X^1$, the optimization problem is:

$$\begin{aligned} \min _{X^{1'}, \dots , X^{M'}} \sum _{m=1}^M\text {KL} (P^{m} || Q^{m'}) + \beta \sum _{m=2}^M ||X^{1'}-X^{m'} (\varGamma ^m)^T ||^2_F, \end{aligned}$$

(11)

where $X^{m'}$ is the lower dimensional embedding of $X^m$, $P^m$ is defined as in Eq. 9, and $\varGamma ^m$ is the coupling matrix from solving Eq. 6 for $m=1,2, \dots , M$, $X^{m'}$. The probability matrix $Q^m$ is computed through a Student-t distribution with one degree of freedom:

$$\begin{aligned} Q^{m'}_{ij} = \frac{ (1 + ||x_i^{m'} - x_j^{m'}||)^{-1} }{ \sum _{k \ne l} 1 + ( || x_k^{m'} - x_l^{m'} || )^{-1} }. \end{aligned}$$

(12)

The intuition behind the cost $\text {KL} (P^{m} || Q^{m'})$ is very similar to that of GW; if two points have a high transition probability in the original space, then they should also have a high transition probability in the latent space.

1.2 Hyperparameter Tuning Procedure Details

For each alignment method, we define a grid of hyperparameters and choose the best performing combination for each experiment. If methods share similar hyperparameters in their formulation, we keep the range defined for these consistent across all algorithms. We refer to the publication and the code repository for each method to choose a hyperparameter ranges whenever possible.

For Pamona, we search the number of neighbors in the cell neighborhood graphs, $k \in \{20, 30, \dots , 150\}$, the entropic regularization coefficient, $\epsilon \in \{5e{-}4, 3e{-}4, 1e{-}4, 7e{-}3, 5e{-}3, \dots , 1e{-}2 \}$, geometry preservation trade-off coefficient, $\lambda \in \{0.1, 0.5, 1, 5, 10\}$, and lastly, embedding dimensionality, $p \in \{3, 4, 5, 10, 30, 32\}$, the output dimension for embedding. For UnionCom, we search the trade-off parameter $\beta \in \{0.1, 1, 5, 10, 15, 20\}$, the regularization coefficient $\rho \in \{0, 0.1, 1, 5, 10, 15, 20\}$, the maximum neighborhood size permitted in the neighborhood graphs, $k_{max} \in \{40, 100, 150\}$, and embedding dimensionality $p \in \{3, 4, 5, 10, 30, 32\}$. For MMD-MA:, we tune the weights $\lambda _1$ and $\lambda _2$ $\in \{1e{-}2, 5e{-}3, 1e{-}3, 5e{-}4, \dots , 1e{-}9\}$, and the embedding dimensionality, $p \in \{3,4,5,10,30,32\}$. For bindSC, we choose the coefficient that assigns weight to the initial gene activity matrix $\alpha \in \{0, 0.1, 0.2, \dots 0.9\}$, the coefficient that assigns weight factor to multi-objective function $\lambda \in \{0.1, 0.2, \dots , 0.9\}$, and the number of canonical vectors for the embdedding space $K \in \{3, 4, 5, 10, 30, 32\}$. Lastly, for Seuratv4, we tune the number of neighbors to consider when finding anchors, $k \in \{5, 10, 15, 20\}$, co-embedding dimensionality, $p \in \{3, 4, 5, 10, 30, 32\}$ and the choice of the reference and anchor domains when finding anchors.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Demetçi, P., Santorella, R., Sandstede, B., Singh, R. (2022). Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-04749-7_1
Published: 29 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Embedding Method Details

1.2 Hyperparameter Tuning Procedure Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation