Abstract
Dimension reduction is a common problem when analysing large data sets. The present paper proposes a method called reduced multidimensional scaling based on performing an initial standard multidimensional scaling on a reduced data set. This method faces the problem of finding a representative reduced sample. An algorithm is presented to perform this selection based on alternating sampling in outlier areas and observations in high density areas. A space is then constructed with the selected reduced sample by standard multidimentional scaling using pairwise distances. The observations not included in the reduced sample are then projected on the constructed space using Gower’s formula in order to obtain a final representation of the whole data set. The only requirement is the ability to compute distances among observations. A simulation study showed that the proposed algorithm results performs well to detect outliers. Evaluation of running times suggests that the proposed method could run in a few hours with data sets that would take more than one year to analyse with standard multidimensional scaling. An application is presented with a dataset of 9547 DNA sequences of human immunodeficiency viruses.





Similar content being viewed by others
References
Abraham G, Inouye M (2014) Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9(4):e93766. https://doi.org/10.1371/journal.pone.0093766
Baglama J, Lothar R (2005) Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J Sci Comput 27(1):19–42
Baglama J, Reichel L, Lewis BW (2019) irlba: fast truncated singular value decomposition and principal components analysis for large dense and sparse matrices. https://CRAN.R-project.org/package=irlba, R package version 2.3.3
Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, Ginhoux F, Newell EW (2019) Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37:38–44. https://doi.org/10.1038/nbt.4314
Beugin MP, Gayet T, Pontier D, Devillard S, Jombart T (2018) A fast likelihood solution to the genetic clustering problem. Methods Ecol Evol 9(4):1006–1016. https://doi.org/10.1111/2041-210X.12968
Degras D, Cardot H (2016) Online principal component analysis. https://CRAN.R-project.org/package=onlinePCA, r package version 1.3.1
D’Enza AI, Markos A, Buttarazzi D (2018) The idm package: incremental decomposition methods in R. J Stat Softw Code Snippets 86(4):1–24. https://doi.org/10.18637/jss.v086.c04
Erichson NB, Voronin S, Brunton SL, Kutz JN (2019) Randomized matrix decompositions using R. J Stat Softw 89(11):1–48. https://doi.org/10.18637/jss.v089.i11
Franch G, Jurman G, Coviello L, Pendesini M, Furlanello C (2019) MASS-UMAP: fast and accurate analog ensemble search in weather radar archives. Remote Sens 11(24):2922. https://doi.org/10.3390/rs11242922
Gower JC (1968) Adding a point to vector diagrams in multivariate analysis. Biometrika 55(3):582–585
Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288. https://doi.org/10.1137/090771806
Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1):1–27
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1371/journal.pone.00937660
McInnes L, Healy J, Saul N, Großberger L (2018) UMAP: uniform manifold approximation and projection. J Open Sour Softw 3:861. https://doi.org/10.21105/joss.00861
Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T (2015) PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J Comput Biol 22(5):377–386. https://doi.org/10.1371/journal.pone.00937662
Paradis E (2018) Multidimensional scaling with very large data sets. J Comput Gr Stat 27(4):935–939. https://doi.org/10.1080/10618600.2018.1470001
Paradis E (2020) Population genomics with R. Chapman & Hall, Boca Raton, FL
Paradis E, Schliep K (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35(3):526–528. https://doi.org/10.1093/bioinformatics/bty633
Qiu Y, Mei J (2019) RSpectra: solvers for large-scale eigenvalue and SVD problems. https://doi.org/10.1371/journal.pone.00937664, r package version 0.16-0
R Core Team (2021) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://doi.org/10.1371/journal.pone.00937665
Roweis S (1998) EM algorithms for PCA and SPCA. In: Neural Information Processing Systems 10 (NIPS’97), pp 626–632
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: astronomical or genomical? PLoS Biol 13(7):e1002195
Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 20:269. https://doi.org/10.1371/journal.pone.00937666
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Wan S, Kim J, Won KJ (2020) SHARP: hyperfast and accurate processing of single-cell RNA-seq via ensemble random projection. Genome Res 30:205–213. https://doi.org/10.1371/journal.pone.00937667
Acknowledgements
I am grateful to two anonymous reviewers for their constructive comments on a previous version of this article. This is publication ISEM 2021-118.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Paradis, E. Reduced multidimensional scaling. Comput Stat 37, 91–105 (2022). https://doi.org/10.1007/s00180-021-01116-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01116-0