Fast t-SNE algorithm with forest of balanced LSH trees and hybrid computation of repulsive forces
Introduction
Data visualization is an important task in machine learning, as people prefer visual representations of data over numerical ones.
We can enumerate simple (in terms of the underlying mathematics) methods like scatter plots, parallel coordinates, histograms, heat maps, survey plots, dimensional stacking, rad viz, etc. [1], [2], but such methods can be useful for detecting only some (simple) regularities in data.
One of the most universally known methods of dimensionality reduction is Principal Component Analysis (PCA) [3]. PCA can be used to select directions with the highest variances in the given data. PCA results in a linear and orthogonal transformation. However, such a transformation from multidimensional space to 2D or 3D will rarely produce readable shapes, because a strong reduction cannot preserve all the initial distances from a high-dimensional space. An advantage of PCA is its complexity, where is the number of instances and the is the number of attributes. The well-known kernel trick [4] changes the linear PCA into kernel nonlinear PCA. Schölkopf [5] has proposed this idea.
Multidimensional scaling (MDS) was one of the first very practical nonlinear dimensionality reduction methods [6]. The goal of MDS is to preserve distances from a high-dimensional space in a low-dimensional space using the below definition of a cost function: Let us assume we have dataset which consists of vectors in . are the points corresponding, respectively, to in a low-dimensional space. Typically, PCA is used to initialize the positions of the low-dimensional points . The complexity of MDS is and is composed of PCA, a single computation of all distances in high dimensional space and the main loop of MDS.1
The Sammon mapping [7] can be seen as a variant of MDS. The goal of Sammon mapping is similar to the MDS.
A neural approach called Self-Organized Maps (SOM) was proposed by Kohonen [8]. SOM creates a map/grid from the original data. Another nonlinear and neural approach was the Neuronal Gas [9], [10].
The IsoMap [11] algorithm is based on the idea of the geodesic distance. A closely related graph-based approach was proposed in [12]. Both of the above methods are locally oriented, as is another, the Local Linear Embedding [13].
Stochastic Neighbor Embedding (SNE) was proposed by Hinton and Roweis [14]. SNE and its variants are the main subjects of this article. In [15], the authors described two important improvements. Another type of acceleration in the domain of document analysis was proposed in [16].
One of the newer approaches in dimensionality reduction is LargeVis proposed by Tang et al. [17]. This algorithm is based on the construction of an approximated K-nearest neighbor graph. However, while this algorithm is faster than tree-based t-SNE (around two times), it is significantly slower than our algorithm because LargerVis is also slower than UMAP.
Another new approach in dimensionality reduction is UMAP [18], which also uses cross-entropy as t-SNE. The speed of this algorithm is compared with our results in Section 4.
For a review of dimensionality reduction methods please see [19], where several algorithms are presented, their short descriptions and additional remarks and comparison. In the review, you can find that most of algorithms are strongly time-consuming.
The next section describes the Stochastic Neighbor Embedding and its improvements in more detail. The following section describes several accelerations of t-SNE which exhibit better complexity. Section 4 presents several testing procedures and an analysis of the obtained results, which show the superiority of the proposed methods.
Section snippets
Previous work on t-Distributed stochastic neighbor embedding
First assume there is a dataset , ( is the number of instances and is the number of attributes). t-Distributed Stochastic Neighbor Embedding (t-SNE) is a positive extension of work on Stochastic Neighbor Embedding [14]. The description of t-SNE starts by defining two probabilities. First for the base dissimilarity in the high-dimensional space: is selected by a binary search so that the probability reaches a fixed perplexity
Stochastic neighbor embedding with a forest of locality-sensitive hashing trees and hybrid computation of repulsive forces
This section presents three ways of t-SNE speedup. In consequence, all elements of t-SNE are significantly faster.
- •
First, a dedicated type of locality-sensitive hashing trees that are used for attractive forces computation in t-SNE in place of the Vantage-point tree is introduced. This yields much better computational costs.
- •
Second, a hybrid computation of repulsive forces that combines Barnes–Hut approximation with piecewise polynomial interpolation is proposed, also reducing computational costs.
Experiments and result analysis
The same benchmark datasets and a similar method of results comparison as presented by Maaten [20] and Linderman et al. [23] and additionally a few others are used to present a trustworthy comparison of different versions of t-SNE algorithms.
Conclusions
A few important acceleration approaches to the well-known t-SNE algorithm, the best known, in terms of complexity and the accuracy of the resulting dimensionality reduction, were presented. First, the LSHF-BH t-SNE algorithm was presented, which is significantly faster than the previous t-SNE variant proposed by Maaten [20]. Such forest of dedicated balanced LSH trees is much faster than a Vantage-point tree or an Annoy for nearest neighbors computation.
The LSHF-Hybrid t-SNE algorithm was
CRediT authorship contribution statement
Marek Orliński: Conceptualization, Methodology, Software, Data, Writing, Visualization. Norbert Jankowski: Conceptualization, Methodology, Data, Writing, Visualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (40)
- et al.
Dimensionality reduction for documents with nearest neighbor queries
Neurocomputing
(2015) - et al.
High-dimensional visualizations
- et al.
Quality metrics in high-dimensional data visualization: An overview and systematization
IEEE Trans. Vis. Comput. Graphics
(2011) Analysis of a complex of statistical variables into principal components
J. Educ. Psychol.
(1933)- et al.
A training algorithm for optimal margin classifiers
Nonlinear component analysis as a kernel eigenvalue problem
Neural Comput.
(1998)Multidimensional scaling i: Theory and method
Psychometrika
(1952)A nonlinear mapping for data structure analysis
IEEE Trans. Comput.
(1969)Self-Organizing Maps
(1995)A growing neural gas network learns topologies
A self-organizing network that can follow non-stationary distributions
Mapping a manifold of perceptual observations
Learning a kernel matrix for nonlinear dimensionality reduction
Nonlinear dimensionality reduction by locally linear embedding
Science
Stochastic neighbor embedding
Visualizing data using t-SNE
J. Mach. Learn. Res.
Umap: Uniform manifold approximation and projection for dimension reduction
Dimensionality Reduction: A Comparative ReviewTech. Rep. TiCC–TR 2009–005
Accelerating t-SNE using tree-based algorithms
J. Mach. Learn. Res.
Cited by (7)
A novel constrained dense convolutional autoencoder and DNN-based semi-supervised method for shield machine tunnel geological formation recognition
2022, Mechanical Systems and Signal ProcessingA Fast and Efficient Algorithm for Filtering the Training Dataset
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A hybrid tool wear prediction model based on JDA
2022, Research SquareApplying t-sne to estimate image sharpness of low-cost nailfold capillaroscopy
2022, Intelligent Automation and Soft ComputingRevdbscan and Flexscan— O(nlog n) Clustering Algorithms
2021, Communications in Computer and Information Science