Skip to main content

Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2022)

Abstract

Integrated analysis of multi-omics data allows the study of how different molecular views in the genome interact to regulate cellular processes; however, with a few exceptions, applying multiple sequencing assays on the same single cell is not possible. While recent unsupervised algorithms align single-cell multi-omic datasets, these methods have been primarily benchmarked on co-assay experiments rather than the more common single-cell experiments taken from separately sampled cell populations. Therefore, most existing methods perform subpar alignments on such datasets. Here, we improve our previous work Single Cell alignment using Optimal Transport (SCOT) by using unbalanced optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. We show that our proposed method, SCOTv2, consistently yields quality alignments on five real-world single-cell datasets with varying cell-type proportions and is computationally tractable. Additionally, we extend SCOTv2 to integrate multiple (\(M\ge 2\)) single-cell measurements and present a self-tuning heuristic process to select hyperparameters in the absence of any orthogonal correspondence information.

Available at: http://rsinghlab.github.io/SCOT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Preprocessed data for the scGEM dataset accessed here: https://github.com/jw156605/MATCHER.

  2. 2.

    Dimensionality reduced data, used by Pamona and us, here: https://github.com/caokai1073/Pamona/tree/master/scNMT. Preprocessing scripts for the raw data provided by the authors here: https://github.com/PMBio/scNMT-seq/.

References

  1. Bonora, G., et al.: Single-cell landscape of nuclear configuration and gene expression during stem cell differentiation and x inactivation. Genome Biol. 22(1), 279 (2021). https://doi.org/10.1186/s13059-021-02432-w

    Article  Google Scholar 

  2. Alvarez-Melis, D., Jaakkola, T.S.: Gromov-wasserstein alignment of word embedding spaces. arXiv preprint arXiv:1809.00013 (2018)

  3. Argelaguet, R., Clark, S.J., Mohammed, H., Stapel, L.C., Krueger, C., Kapourani, C.A., et al.: Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576(7787), 487–491 (2019). https://doi.org/10.1038/s41586-019-1825-8

    Article  Google Scholar 

  4. Cao, K., Bai, X., Hong, Y., Wan, L.: Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 36(Suppl._1), i48–i56 (2020)

    Google Scholar 

  5. Cao, K., Hong, Y., Wan, L.: Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38(1), 211–219 (2021). https://doi.org/10.1093/bioinformatics/btab594

    Article  Google Scholar 

  6. Chen, S., Lake, B.B., Zhang, K.: High-throughput sequencing of transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37(12), 1452–1457 (2019)

    Article  Google Scholar 

  7. Cheow, L.F., Courtois, E.T., Tan, Y., Viswanathan, R., Xing, Q., Tan, R.Z., et al.: Single-cell multimodal profiling reveals cellular epigenetic heterogeneity. Nat. Methods 13(10), 833–836 (2016)

    Article  Google Scholar 

  8. Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., et al.: scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9(1), 1–9 (2018)

    Article  Google Scholar 

  9. Demetci, P., Santorella, R., Sandstede, B., Noble, W.S., Singh, R.: Gromov-wasserstein optimal transport to align single-cell multi-omics data. BioRxiv (2020)

    Google Scholar 

  10. Dou, J., Liang, S., Mohanty, V., Cheng, X., Kim, S., Choi, J., et al.: Unbiased integration of single cell multi-omics data. bioRxiv (2020). https://doi.org/10.1101/2020.12.11.422014. https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422014

  11. Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new hellinger-kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)

    Article  MathSciNet  Google Scholar 

  12. Liu, J., Huang, Y., Singh, R., Vert, J.P., Noble, W.S.: Jointly embedding multiple single-cell omics measurements. BioRxiv, p. 644310 (2019)

    Google Scholar 

  13. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  14. Singh, R., Demetci, P., Bonora, G., Ramani, V., Lee, C., Fang, H., et al.: Unsupervised manifold alignment for single-cell multi-omics data. In: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–10 (2020)

    Google Scholar 

  15. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W.M.M., et al.: Comprehensive integration of single-cell data. Cell 77(7), 1888–1902 (2019)

    Google Scholar 

  16. Séjourné, T., Vialard, F.X., Peyré, G.: The unbalanced gromov wasserstein distance: Conic formulation and relaxation. arXiv (2021)

    Google Scholar 

  17. Welch, J.D., Hartemink, A.J., Prins, J.F.: Matcher: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol. 18(1), 138 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ritambhara Singh .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Embedding Method Details

The full details of t-SNE can be found in [13]. For each domain m, we compute \(P^{m}\), an \(n_m \times n_m\) cell-to-cell transition matrix; each entry \(P^{m}_{j|i}\) is the conditional probability that a data point \(x_i^m\) would pick \(x_j^m\) as its neighbor when chosen according a Gaussian distribution centered at \(x_i^m\):

$$\begin{aligned} P_{j|i}^m =\frac{\exp (- ||x_i^m - x_j^m||^2 / 2\sigma _i^2)}{\sum _{k \ne i} \exp (-||x_i^m - x_k^m||^2/2\sigma _i^2)}. \end{aligned}$$
(9)

The bandwidth \(\sigma _i\) is chosen according to the density of the data points through a binary search for the value of \(\sigma _i\) that achieves the user-supplied perplexity value. \(P^m\) is computed by averaging \(P^m_{i|j}\) and \(P^m_{j|i}\) to give more weight to outlier points:

$$\begin{aligned} P^m_{ij} = \frac{P_{i|j}^m + P_{j|i}^m}{2 n_m} \end{aligned}$$
(10)

Then, to jointly embed all domains through the anchor domain \(X^1\), the optimization problem is:

$$\begin{aligned} \min _{X^{1'}, \dots , X^{M'}} \sum _{m=1}^M\text {KL} (P^{m} || Q^{m'}) + \beta \sum _{m=2}^M ||X^{1'}-X^{m'} (\varGamma ^m)^T ||^2_F, \end{aligned}$$
(11)

where \(X^{m'}\) is the lower dimensional embedding of \(X^m\), \(P^m\) is defined as in Eq. 9, and \(\varGamma ^m\) is the coupling matrix from solving Eq. 6 for \(m=1,2, \dots , M\), \(X^{m'}\). The probability matrix \(Q^m\) is computed through a Student-t distribution with one degree of freedom:

$$\begin{aligned} Q^{m'}_{ij} = \frac{ (1 + ||x_i^{m'} - x_j^{m'}||)^{-1} }{ \sum _{k \ne l} 1 + ( || x_k^{m'} - x_l^{m'} || )^{-1} }. \end{aligned}$$
(12)

The intuition behind the cost \(\text {KL} (P^{m} || Q^{m'})\) is very similar to that of GW; if two points have a high transition probability in the original space, then they should also have a high transition probability in the latent space.

1.2 Hyperparameter Tuning Procedure Details

For each alignment method, we define a grid of hyperparameters and choose the best performing combination for each experiment. If methods share similar hyperparameters in their formulation, we keep the range defined for these consistent across all algorithms. We refer to the publication and the code repository for each method to choose a hyperparameter ranges whenever possible.

For Pamona, we search the number of neighbors in the cell neighborhood graphs, \(k \in \{20, 30, \dots , 150\}\), the entropic regularization coefficient, \(\epsilon \in \{5e{-}4, 3e{-}4, 1e{-}4, 7e{-}3, 5e{-}3, \dots , 1e{-}2 \}\), geometry preservation trade-off coefficient, \(\lambda \in \{0.1, 0.5, 1, 5, 10\}\), and lastly, embedding dimensionality, \(p \in \{3, 4, 5, 10, 30, 32\}\), the output dimension for embedding. For UnionCom, we search the trade-off parameter \(\beta \in \{0.1, 1, 5, 10, 15, 20\}\), the regularization coefficient \(\rho \in \{0, 0.1, 1, 5, 10, 15, 20\}\), the maximum neighborhood size permitted in the neighborhood graphs, \(k_{max} \in \{40, 100, 150\}\), and embedding dimensionality \(p \in \{3, 4, 5, 10, 30, 32\}\). For MMD-MA:, we tune the weights \(\lambda _1\) and \(\lambda _2\) \(\in \{1e{-}2, 5e{-}3, 1e{-}3, 5e{-}4, \dots , 1e{-}9\}\), and the embedding dimensionality, \(p \in \{3,4,5,10,30,32\}\). For bindSC, we choose the coefficient that assigns weight to the initial gene activity matrix \(\alpha \in \{0, 0.1, 0.2, \dots 0.9\}\), the coefficient that assigns weight factor to multi-objective function \(\lambda \in \{0.1, 0.2, \dots , 0.9\}\), and the number of canonical vectors for the embdedding space \(K \in \{3, 4, 5, 10, 30, 32\}\). Lastly, for Seuratv4, we tune the number of neighbors to consider when finding anchors, \(k \in \{5, 10, 15, 20\}\), co-embedding dimensionality, \(p \in \{3, 4, 5, 10, 30, 32\}\) and the choice of the reference and anchor domains when finding anchors.

figure c

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Demetçi, P., Santorella, R., Sandstede, B., Singh, R. (2022). Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04749-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04748-0

  • Online ISBN: 978-3-031-04749-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics