Abstract
Integrated analysis of multi-omics data allows the study of how different molecular views in the genome interact to regulate cellular processes; however, with a few exceptions, applying multiple sequencing assays on the same single cell is not possible. While recent unsupervised algorithms align single-cell multi-omic datasets, these methods have been primarily benchmarked on co-assay experiments rather than the more common single-cell experiments taken from separately sampled cell populations. Therefore, most existing methods perform subpar alignments on such datasets. Here, we improve our previous work Single Cell alignment using Optimal Transport (SCOT) by using unbalanced optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. We show that our proposed method, SCOTv2, consistently yields quality alignments on five real-world single-cell datasets with varying cell-type proportions and is computationally tractable. Additionally, we extend SCOTv2 to integrate multiple (\(M\ge 2\)) single-cell measurements and present a self-tuning heuristic process to select hyperparameters in the absence of any orthogonal correspondence information.
Available at: http://rsinghlab.github.io/SCOT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Preprocessed data for the scGEM dataset accessed here: https://github.com/jw156605/MATCHER.
- 2.
Dimensionality reduced data, used by Pamona and us, here: https://github.com/caokai1073/Pamona/tree/master/scNMT. Preprocessing scripts for the raw data provided by the authors here: https://github.com/PMBio/scNMT-seq/.
References
Bonora, G., et al.: Single-cell landscape of nuclear configuration and gene expression during stem cell differentiation and x inactivation. Genome Biol. 22(1), 279 (2021). https://doi.org/10.1186/s13059-021-02432-w
Alvarez-Melis, D., Jaakkola, T.S.: Gromov-wasserstein alignment of word embedding spaces. arXiv preprint arXiv:1809.00013 (2018)
Argelaguet, R., Clark, S.J., Mohammed, H., Stapel, L.C., Krueger, C., Kapourani, C.A., et al.: Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576(7787), 487–491 (2019). https://doi.org/10.1038/s41586-019-1825-8
Cao, K., Bai, X., Hong, Y., Wan, L.: Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 36(Suppl._1), i48–i56 (2020)
Cao, K., Hong, Y., Wan, L.: Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38(1), 211–219 (2021). https://doi.org/10.1093/bioinformatics/btab594
Chen, S., Lake, B.B., Zhang, K.: High-throughput sequencing of transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37(12), 1452–1457 (2019)
Cheow, L.F., Courtois, E.T., Tan, Y., Viswanathan, R., Xing, Q., Tan, R.Z., et al.: Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13(10), 833–836 (2016)
Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., et al.: scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9(1), 1–9 (2018)
Demetci, P., Santorella, R., Sandstede, B., Noble, W.S., Singh, R.: Gromov-wasserstein optimal transport to align single-cell multi-omics data. BioRxiv (2020)
Dou, J., Liang, S., Mohanty, V., Cheng, X., Kim, S., Choi, J., et al.: Unbiased integration of single cell multi-omics data. bioRxiv (2020). https://doi.org/10.1101/2020.12.11.422014. https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422014
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)
Liu, J., Huang, Y., Singh, R., Vert, J.P., Noble, W.S.: Jointly embedding multiple single-cell omics measurements. BioRxiv, p. 644310 (2019)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Singh, R., Demetci, P., Bonora, G., Ramani, V., Lee, C., Fang, H., et al.: Unsupervised manifold alignment for single-cell multi-omics data. In: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–10 (2020)
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W.M.M., et al.: Comprehensive integration of single-cell data. Cell 77(7), 1888–1902 (2019)
Séjourné, T., Vialard, F.X., Peyré, G.: The unbalanced gromov wasserstein distance: Conic formulation and relaxation. arXiv (2021)
Welch, J.D., Hartemink, A.J., Prins, J.F.: Matcher: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol. 18(1), 138 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Embedding Method Details
The full details of t-SNE can be found in [13]. For each domain m, we compute \(P^{m}\), an \(n_m \times n_m\) cell-to-cell transition matrix; each entry \(P^{m}_{j|i}\) is the conditional probability that a data point \(x_i^m\) would pick \(x_j^m\) as its neighbor when chosen according a Gaussian distribution centered at \(x_i^m\):
The bandwidth \(\sigma _i\) is chosen according to the density of the data points through a binary search for the value of \(\sigma _i\) that achieves the user-supplied perplexity value. \(P^m\) is computed by averaging \(P^m_{i|j}\) and \(P^m_{j|i}\) to give more weight to outlier points:
Then, to jointly embed all domains through the anchor domain \(X^1\), the optimization problem is:
where \(X^{m'}\) is the lower dimensional embedding of \(X^m\), \(P^m\) is defined as in Eq. 9, and \(\varGamma ^m\) is the coupling matrix from solving Eq. 6 for \(m=1,2, \dots , M\), \(X^{m'}\). The probability matrix \(Q^m\) is computed through a Student-t distribution with one degree of freedom:
The intuition behind the cost \(\text {KL} (P^{m} || Q^{m'})\) is very similar to that of GW; if two points have a high transition probability in the original space, then they should also have a high transition probability in the latent space.
1.2 Hyperparameter Tuning Procedure Details
For each alignment method, we define a grid of hyperparameters and choose the best performing combination for each experiment. If methods share similar hyperparameters in their formulation, we keep the range defined for these consistent across all algorithms. We refer to the publication and the code repository for each method to choose a hyperparameter ranges whenever possible.
For Pamona, we search the number of neighbors in the cell neighborhood graphs, \(k \in \{20, 30, \dots , 150\}\), the entropic regularization coefficient, \(\epsilon \in \{5e{-}4, 3e{-}4, 1e{-}4, 7e{-}3, 5e{-}3, \dots , 1e{-}2 \}\), geometry preservation trade-off coefficient, \(\lambda \in \{0.1, 0.5, 1, 5, 10\}\), and lastly, embedding dimensionality, \(p \in \{3, 4, 5, 10, 30, 32\}\), the output dimension for embedding. For UnionCom, we search the trade-off parameter \(\beta \in \{0.1, 1, 5, 10, 15, 20\}\), the regularization coefficient \(\rho \in \{0, 0.1, 1, 5, 10, 15, 20\}\), the maximum neighborhood size permitted in the neighborhood graphs, \(k_{max} \in \{40, 100, 150\}\), and embedding dimensionality \(p \in \{3, 4, 5, 10, 30, 32\}\). For MMD-MA:, we tune the weights \(\lambda _1\) and \(\lambda _2\) \(\in \{1e{-}2, 5e{-}3, 1e{-}3, 5e{-}4, \dots , 1e{-}9\}\), and the embedding dimensionality, \(p \in \{3,4,5,10,30,32\}\). For bindSC, we choose the coefficient that assigns weight to the initial gene activity matrix \(\alpha \in \{0, 0.1, 0.2, \dots 0.9\}\), the coefficient that assigns weight factor to multi-objective function \(\lambda \in \{0.1, 0.2, \dots , 0.9\}\), and the number of canonical vectors for the embdedding space \(K \in \{3, 4, 5, 10, 30, 32\}\). Lastly, for Seuratv4, we tune the number of neighbors to consider when finding anchors, \(k \in \{5, 10, 15, 20\}\), co-embedding dimensionality, \(p \in \{3, 4, 5, 10, 30, 32\}\) and the choice of the reference and anchor domains when finding anchors.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Demetçi, P., Santorella, R., Sandstede, B., Singh, R. (2022). Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)