skip to main content
research-article

Context-Based Evaluation of Dimensionality Reduction Algorithms—Experiments and Statistical Significance Analysis

Published: 04 January 2021 Publication History

Abstract

Dimensionality reduction is a commonly used technique in data analytics. Reducing the dimensionality of datasets helps not only with managing their analytical complexity but also with removing redundancy. Over the years, several such algorithms have been proposed with their aims ranging from generating simple linear projections to complex non-linear transformations of the input data. Subsequently, researchers have defined several quality metrics in order to evaluate the performances of different algorithms. Hence, given a plethora of dimensionality reduction algorithms and metrics for their quality analysis, there is a long-existing need for guidelines on how to select the most appropriate algorithm in a given scenario. In order to bridge this gap, in this article, we have compiled 12 state-of-the-art quality metrics and categorized them into 5 identified analytical contexts. Furthermore, we assessed 15 most popular dimensionality reduction algorithms on the chosen quality metrics using a large-scale and systematic experimental study. Later, using a set of robust non-parametric statistical tests, we assessed the generalizability of our evaluation on 40 real-world datasets. Finally, based on our results, we present practitioners’ guidelines for the selection of an appropriate dimensionally reduction algorithm in the present analytical contexts.

Supplementary Material

a24-ghosh-suppl.pdf (ghosh.zip)
Supplemental movie, appendix, image and software files for, Context-Based Evaluation of Dimensionality Reduction Algorithms—Experiments and Statistical Significance Analysis

References

[1]
L. van der Maaten, E. O. Postma, and H. J. van den Herik. 2008. Dimensionality reduction : A comparative review. J. Mach. Learn. Res. 10, 1--41 (2008), 66--71.
[2]
J. P. Cunningham and Z. Ghahramani. 2015. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res. 16, 1 (2015), 2859--2900.
[3]
M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas. 2002. Non-linear dimensionality reduction techniques for classification and visualization. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD’02). 645--651.
[4]
C. T. Jr, A. Traina, L. Wu, and C. Faloutsos. 2010. Fast feature selection using fractal dimension. J. Inf. Data Manag. 1, 1 (2010), 3.
[5]
E. Becht, L. McInnes, J. Healy, C. A. Dutertre, I. W. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell. 2018. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 1 (2018), 38--44.
[6]
A. C. Fraideinberze. 2016. Effective and unsupervised fractal-based feature selection for very large datasets: Removing linear and non-linear attribute correlations. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). 615--622.
[7]
L. Zhang, L. Zhang, D. Tao, and X. Huang. 2012. On combining multiple features for hyperspectral remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 50, 3 (2012), 879--893.
[8]
R. Bro and A. K. Smilde. 2014. Principal component analysis. Anal. Methods 6, 9 (2014), 2812--2831.
[9]
L. McInnes, J. Healy, and J. Melville. 2019. UMAP: Uniform manifold approximation and projection for dimension reduction. Journal of Open Source Software 3, 29 (2019), 861. https://doi.org/10.21105/joss.00861
[10]
L. van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 86 (2008), 2579--2605.
[11]
J. A. Lee, D. H. Peluffo-Ordóñez, and M. Verleysen. 2015. Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing 169 (2015), 246--261.
[12]
J. A. Lee and M. Verleysen. 2009. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72, 7--9 (2009), 1431--1443.
[13]
E. Amid and M. K. Warmuth. 2019. A more globally accurate dimensionality reduction method using triplets. arXiv:1803.00854. Retrieved April 8, 2019 from http://arxiv.org/abs/1803.00854.
[14]
Shiming Xiang, Feiping Nie, Changshui Zhang, and Chunxia Zhang. 2009. Nonlinear dimensionality reduction with local spline embedding. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1285--1298.
[15]
J. L. Suárez, S. García, and F. Herrera. 2020. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing. In press. https://doi.org/10.1016/j.neucom.2020.08.017
[16]
B. Li, Z.-T. Fan, X.-L. Zhang, and D.-S. Huang. 2019. Robust dimensionality reduction via feature space to feature space distance metric learning. Neural Netw. 112 (Apr. 2019), 1--14.
[17]
A. Bibal and B. Frenay. 2019. Measuring quality and interpretability of dimensionality reduction visualizations. In Proceedings of the Safe Machine Learning Workshop at ICLR. 7.
[18]
T. Hastie, R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning (2nd ed.). Springer, New York, NY.
[19]
A. Arcuri and L. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceeding of the 33rd International Conference on Software Engineering (ICSE’11). 1.
[20]
J. Demśar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1 (2006), 1--30.
[21]
J. A. Lee and M. Verleysen. 2010. Scale-independent quality criteria for dimensionality reduction. Pattern Recognit. Lett. 31, 14 (2010), 2248--2257.
[22]
S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang. 2009. Embedding new data points for manifold learning via coordinate propagation. Knowl. Inf. Syst. 19, 2 (2009), 159--184.
[23]
J. Johannemann and R. Tibshirani. 2019. Spectral overlap and a comparison of parameter-free, dimensionality reduction quality metrics. arXiv190701974 Cs Stat. Retrieved September 30, 2019 from http://arxiv.org/abs/1907.01974.
[24]
A. Ghosh, M. Nashaat, J. Miller, S. Quader, and C. Marston. 2018. A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Vis. Inform. 2, 4 (2018), 235--253.
[25]
A. Lazar, L. Jin, C. A. Spurlock, K. Wu, A. Sim, and A. Todd. 2019. Evaluating the effects of missing values and mixed data types on social sequence clustering using t-SNE visualization. J. Data Inf. Qual. 11, 2 (2019), 1--22.
[26]
J. Tang, J. Liu, M. Zhang, and Q. Mei. 2016. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 287--297.
[27]
G. Navarro, R. Paredes, N. Reyes, and C. Bustos. 2017. An empirical evaluation of intrinsic dimension estimators. Inf. Syst. 64 (2017), 206--218.
[28]
G. C. Linderman, M. Rachh, J. G. Hoskins, S. Steinerberger, and Y. Kluger. 2019. Efficient algorithms for t-distributed stochastic neighborhood embedding. Nat. Methods 16, 3 (2019), 243--245.
[29]
Xiaofei He, Deng Cai, Shuicheng Yan, and Hong-Jiang Zhang. 2005. Neighborhood preserving embedding. In Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV’05), Vol. 1. 1208--1213.
[30]
C. C. Aggarwal. 2001. On the effects of dimensionality reduction on high dimensional similarity search. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’01), 256--266.
[31]
J. Wenskovitch, I. Crandell, N. Ramakrishnan, L. House, S. Leman, and C. North. 2018. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE Trans. Vis. Comput. Graph. 24, 1 (2018), 131--141.
[32]
K. M. Sunderland, D. Beaton, J. Fraser, D. Kwan, P. M. McLaughlin, M. Montero-Odasso, A. J. Peltsch, F. Pieruccini-Faria, D. J. Sahlas, R. H. Swartz, and S. C. Strother. 2019. The utility of multivariate outlier detection techniques for data quality evaluation in large studies: An application within the ONDRI project. BMC Med. Res. Methodol. 19, 1 (May 2019), 102.
[33]
C. O. S. Sorzano, J. Vargas, and A. Pascual. A survey of dimensionality reduction techniques. arXiv14032877.
[34]
C. Zhang, S. Xiang, F. Nie, and Y. Song. 2009. Nonlinear dimensionality reduction with relative distance comparison. Neurocomputing 72, 7--9 (2009), 1719--1731.
[35]
B. Rieck and H. Leitte. 2017. Agreement analysis of quality measures for dimensionality reduction. In Topological Methods in Data Analysis and Visualization IV, H. Carr, C. Garth, and T. Weinkauf (Eds.). Springer International Publishing, Cham, 103--117.
[36]
J. A. Lee, E. Renard, G. Bernard, P. Dupont, and M. Verleysen. 2013. Type 1 and 2 mixtures of Kullback--Leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing 112, (2013), 92--108.
[37]
J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov. 2005. Neighbourhood components analysis. In Advances in Neural Information Processing Systems. MIT Press, 513--520.
[38]
O. Kramer. 2011. Dimensionality reduction by unsupervised k-nearest neighbor regression. In Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops. 275--278.
[39]
X. Zhao and S. S.-U. Guan. 2017. A subspace recursive and selective feature transformation method for classification tasks. Big Data Anal. 2, 1 (Dec. 2017), 10.
[40]
C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. 2013. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 4 (2013), 764--766.
[41]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600--612.
[42]
M. Gashler, D. Ventura, and T. Martinez. 2008. Iterative non-linear dimensionality reduction by manifold sculpting. In Advances in Neural Information Processing Systems. MIT Press, 513--520.
[43]
M. Mohammadi, W. Hofman, and Y.-H. Tan. 2019. A comparative study of ontology matching systems via inferential statistics. IEEE Trans. Knowl. Data Eng. 31, 4 (2019), 615--628.
[44]
R. Dror, G. Baumer, S. Shlomov, and R. Reichart. 2018. The hitchhiker's guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, Melbourne, 1383--1392.
[45]
R. Sherman. 1971. Error of the normal approximation to the sum of N random variables. Biometrika 58, 2 (1971), 396--398.
[46]
K. Pearson. 1901. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 11 (Nov. 1901), 559--572.
[47]
J. B. Kruskal. 1964. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 2 (Jun. 1964), 115--129.
[48]
Y. Yang, F. Nie, S. Xiang, Y. Zhuang, and W. Wang. 2010. Local and global regressive mapping for manifold learning with out-of-sample extrapolation. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI’10). 649--654.
[49]
B. Schölkopf, A. Smola, and K.-R. Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 5 (Jul. 1998), 1299--1319.
[50]
D. Ulyanov. 2016. Multicore t-SNE. Github Repository. Retrieved May 14, 2020 from https://github.com/DmitryUlyanov/Multicore-TSNE.
[51]
Q. Hu and C. S. Greene. 2019. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pac. Symp. Biocomput. 24 (2019), 362--373.
[52]
E. Levina and P. J. Bickel. 2005. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems. MIT Press, 777--784.
[53]
E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng. 2003. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems, Vol. 14. MIT Press, 521--528.
[54]
L. Yang. 2007. An Overview of Distance Metric Learning. Technical report. School of Computer Science, Carnegie Mellon University.
[55]
L. Yang and R. Jin. 2006. Distance metric learning: A comprehensive survey. Michigan State Universiy.
[56]
Y. Dong, B. Du, L. Zhang, and L. Zhang. 2017. Dimensionality reduction and classification of hyperspectral images using ensemble discriminative local metric learning. IEEE Trans. Geosci. Remote Sens. 55, 5 (2017), 2509--2524.
[57]
J. Wenskovitch and C. North. 2017. Observation-level interaction with clustering and dimension reduction algorithms. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics (HILDA’17). 1--6.
[58]
D. Dheeru and G. Casey. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[59]
Kaggle. 2019. Kaggle Data Repository. Retrieved June 12, 2019 from https://www.kaggle.com/datasets.
[60]
B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren. 2019. OpenML benchmarking suites and the OpenML100. arXiv:1708.03731. Retrieved June 12, 2019 from http://arxiv.org/abs/1708.03731.
[61]
J. Xia, F. Ye, W. Chen, Y. Wang, W. Chen, Y. Ma, and A. K. Tung. 2018. LDSScanner: Exploratory analysis of low-dimensional structures in high-dimensional datasets. IEEE Trans. Vis. Comput. Graph. 24, 1 (2018), 236--245.
[62]
K. Bunte, M. Biehl, and B. Hammer. 2011. Dimensionality reduction mappings. In Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’11). 349--356.
[63]
W. W. Daniel and C. L. Cross. 2018. Biostatistics: A Foundation for Analysis in the Health Sciences (10th ed.). Wiley.
[64]
P. S. Efraimidis. 2019. Weighted random sampling over data streams. arXiv:1012.0256. Retrieved June 24, 2019 from http://arxiv.org/abs/1012.0256.
[65]
J. Gama. 2010. Knowledge Discovery from Data Streams (1st ed.). Chapman and Hall/CRC, 2010.
[66]
C. C. Aggarwal. 2006. On biased reservoir sampling in the presence of stream evolution. In Proceedings of the 32nd International Conference on Very large Data Bases (VLDB Endowment’06). 607--618.
[67]
R. Tortolani. 1965. Introducing bias intentionally into survey techniques. J. Mark. Res. 2, 1 (1965), 51--55.
[68]
S. Garćıa and F. Herrera. 2008. An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9, 89 (2008), 2677--2694.
[69]
S. García, A. Fernández, J. Luengo, and F. Herrera. 2010. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180, 10 (May 2010), 2044--2064.
[70]
R. R. Bouckaert. 2004. Estimating replicability of classifier learning experiments. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). ACM, 15 pages.
[71]
V. D. Silva and J. B. Tenenbaum. 2003. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems. MIT Press, 721--728.
[72]
D. Meng, Y. Leung, and Z. Xu. 2011. A new quality assessment criterion for nonlinear dimensionality reduction. Neurocomputing 74, 6 (2011), 941--948.
[73]
F. Camastra. 2003. Data dimensionality estimation methods: A survey. Pattern Recognit. 36, 12 (Dec. 2003), 2945--2954.
[74]
F. Camastra and A. Staiano. 2016. Intrinsic dimension estimation: Advances and open problems. Inf. Sci. 328 (2016), 26--41.
[75]
S. Gong, V. N. Boddeti, and A. K. Jain. 2019. On the intrinsic dimensionality of image representations. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 3982--3991.
[76]
Y. Wang, K. Feng, X. Chu, J. Zhang, C. W. Fu, M. Sedlmair, X. Yu, and B. Chen. 2018. A perception-driven approach to supervised dimensionality reduction for visualization. IEEE Trans. Vis. Comput. Graph. 24, 5 (2018), 1828--1840.
[77]
A. Ghosh, M. Nashaat, and J. Miller. 2019. The current state of software license renewals in the I.T. industry. Inf. Softw. Technol. 108 (2019), 139--152.
[78]
R. Feldt and A. Magazinius. 2010. Validity threats in empirical software engineering research—An initial survey. In Proceedings of the 22nd International Conference on Software Engineering 8 Knowledge Engineering (SEKE’10). 374--379.
[79]
C. Hou, C. Zhang, Y. Wu, and Y. Jiao. 2009. Stable local dimensionality reduction approaches. Pattern Recognit. 42, 9 (2009), 2054--2066.

Cited By

View all
  • (2024)Dimension Reduction Algorithm of Power and Government-Enterprise Data Fusion Based on Principal Component Analysis2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC63877.2024.00151(809-813)Online publication date: 14-Aug-2024
  • (2024)Machine Learning-Based Cellular Traffic Prediction Using Data Reduction TechniquesIEEE Access10.1109/ACCESS.2024.339262412(58927-58939)Online publication date: 2024
  • (2022)Asymmetric Multi-Task Learning with Local TransferenceACM Transactions on Knowledge Discovery from Data10.1145/351425216:5(1-30)Online publication date: 5-Apr-2022
  • Show More Cited By

Index Terms

  1. Context-Based Evaluation of Dimensionality Reduction Algorithms—Experiments and Statistical Significance Analysis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 2
      Survey Paper and Regular Papers
      April 2021
      524 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3446665
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 January 2021
      Accepted: 01 October 2020
      Revised: 01 July 2020
      Received: 01 October 2019
      Published in TKDD Volume 15, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Dimensionality reduction
      2. context-based evaluation
      3. quality metrics
      4. statistical significance analysis

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Dimension Reduction Algorithm of Power and Government-Enterprise Data Fusion Based on Principal Component Analysis2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC63877.2024.00151(809-813)Online publication date: 14-Aug-2024
      • (2024)Machine Learning-Based Cellular Traffic Prediction Using Data Reduction TechniquesIEEE Access10.1109/ACCESS.2024.339262412(58927-58939)Online publication date: 2024
      • (2022)Asymmetric Multi-Task Learning with Local TransferenceACM Transactions on Knowledge Discovery from Data10.1145/351425216:5(1-30)Online publication date: 5-Apr-2022
      • (2022)Who will Win the Data Science Competition? Insights from KDD Cup 2019 and BeyondACM Transactions on Knowledge Discovery from Data10.1145/351189616:5(1-24)Online publication date: 5-Apr-2022
      • (2022)Quality-Informed Process Mining: A Case for Standardised Data Quality AnnotationsACM Transactions on Knowledge Discovery from Data10.1145/351170716:5(1-47)Online publication date: 5-Apr-2022
      • (2022)On the Robustness of Metric Learning: An Adversarial PerspectiveACM Transactions on Knowledge Discovery from Data10.1145/350272616:5(1-25)Online publication date: 5-Apr-2022
      • (2022)Application of Machine Learning Techniques to Help in the Feature Selection Related to Hospital Readmissions of Suicidal BehaviorInternational Journal of Mental Health and Addiction10.1007/s11469-022-00868-022:1(216-237)Online publication date: 18-Jul-2022

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media